AES: The Server is found locked up due to hardware alarms.


Doc ID    SOLN290241
Version:    6.0
Status:    Published
Published date:    04 Jan 2024
Created Date:    29 May 2016
Author:   
Jingui Zhang
 

Details

Seen with Hardware:HP/ProLiant DL360p Gen8

•Dom0
[admin@AES6-DOM-DC1 ~]$ swversion
=======================================================================
System Platform Information
=======================================================================
Version 6.3.8.01002.0
Software UUID 3b5bedb3-ca63-4120-9333-c952d71e484c
SVN Revision 18924

=======================================================================
Kernel Version
=======================================================================
Linux 2.6.18-406.AV2.el5xen

• SSH en AES
[cust@AES6AICDC1 ~]$ swversion
***********************************************************************
Application Enablement Services
***********************************************************************
Version: 6.3.3.7.10-0
Server Type: OTHER
Offer Type: VIRTUAL_APPLIANCE_ON_SP
Virtual Machine Information - AES
Software Update Revision: 6.3.3.7.10-0
System Platform Version: 6.3.8.01002.0
***********************************************************************
Operating System Version: Linux 2.6.18-371.6.1.AV2.domU.el5xen

************* Patch Numbers Installed in this system are *************
5
7
***********************************************************************
Use "swversion [-a | --all]" to get a complete list of AE Services RPMS and Pathes/Updates

 

Problem Clarification

System Platform server restarts by itself. AES down, totally freeze, no responding, need to hard reset.

Cause

Below hardware alarms found in hplog:

0006 Critical       09:07  05/11/2016 09:07  05/11/2016 0001
LOG: Unrecoverable System Error (NMI) has occurred.  System Firmware will log additional details in a separate IML entry if possible
 
0007 Critical       09:07  05/11/2016 09:07  05/11/2016 0001
LOG: Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 2, Function 2, Error status 0x00000000)
 
0008 Critical       09:06  05/11/2016 09:06  05/11/2016 0001
LOG: Drive Array Controller Failure (Slot 0)
 
0009 Caution        11:05  05/11/2016 11:05  05/11/2016 0001
LOG: POST Error: 1719 - A controller failure event

Solution

 

In the validate report, no more critical issue is found except the hardware errors on 05/11/2016.

And I did search these error message from HP website, and issues below is quite similar with ours.

http://community.hpe.com/t5/ProLiant-Servers-ML-DL-SL/DL380p-Gen8-with-uncorrectabl-PCI-express-error/td-p/5995669
http://community.hpe.com/t5/ProLiant-Servers-ML-DL-SL/Proliant-DL380-G7-Fatal-PCI-Express-Device-Error-PCI-B00-D00-F00/td-p/5313433

As you can see in this 2 article chains, HP doesn’t have a quite certain way to identify and solve this error.
After the issue happened, we have hard reboot the server, if the issue was not happened after that (as no new hardware error logged in hplog), I think this problem happened due power outage.
 

ADDITIONAL INFORMATION:

This appears to be an HP iLo Issue. Another solution is to upgrade the HP/ProLiant DL360p Gen8 Firmware to latest version.

PSN027020u Name of problem PSN027020u HP DL360P G8 Firmware Update

https://downloads.avaya.com/css/P8/documents/101010876

From validateSP:

=========================================================================

Is BIOS firmware latest? Current: 20130301 Latest: 20140802 [ WARNING ]

=========================================================================


additional external information about this issue with iLo

http://www.linuxquestions.org/questions/linux-general-1/bl460c-g8-host-unexpectedly-reset-4175506736/ ]

Make sure you download the Firmware update from Avaya Support page.
 


Avaya -- Proprietary. Use pursuant to the terms of your signed agreement or Avaya policy