On Rare Occasions, Sun Fire 12K/15K/20K/25K Servers May Experience an Unexpected "Dstop" During a Recovery POST |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 12K Server Sun Fire E20K Server Sun Fire 15K Server Sun Fire E25K Server
|
| Bug Id : | 4943893
|
| Date of Resolved Release : | 02-JUN-2004
|
Impact
A domain on Sun Fire 12K/15K/20K/25K servers may experience a Domain Stop (Dstop) during a recovery POST (hpost(1M) -Q). This issue occurs after a Solaris domain has reset and SMS attempts to recover the domain. Only the domain that has reset is affected, causing recovery to take longer (as POST then runs without "-Q" (quick option)).
Contributing Factors
This issue can occur in the following releases:
SPARC Platform
-
Sun Fire 12K/15K with SMS 1.2
-
Sun Fire 12K/15K with SMS 1.3 without patch 114608-09
-
Sun Fire 12K/15K with SMS 1.4
-
Sun Fire 12K/15K/20K/25K with SMS 1.4.1 without patch 117371-02
Symptoms
Should the described issue occur, the POST log will show the Dstop and an "xcstate" file will be generated under the domain dump directory, similar to the following example:
DSTOP Detected for Slot SB17
No valid errors in any SDI master stop1 status in this dump.
Proceeding using local first-error status.
SDI EX17/S0 Master_Stop_Status0[31:0] = 10000047
MStop0[0]: All SDI logic is DStopped
MStop0[1]: Slot 0 port is DStopped
MStop0[2]: Slot 1 port is DStopped
MStop0[6]: L1 Slot0 error line detected
SDI EX17/S0 Dstop0[31:0] = 02008200
Dstop0[25]: D 1E AXQ requests all Dstop (M)
AXQ EX17 (17) Error_Flag_06[31:0] = 00800080 Mask = 7E00FFFF
Err6[23]: D Attempt to deallocate a deallocated Home
WATransID
AXQ EX17 (17) Error_Flag_07[31:0] = 08000800 Mask = 63FF7D24
Err7[27]: D Home Agent counter underflow
FAIL EXB EX17: SDI EX17/S0 detected Error request from AXQ, no reason found.
Primary service FRU is EXB EX17.
FAIL ENTIRE SYSTEM: FAIL & system_red_any_fail (-Q) set
There is no FRU service action indicated for this failure.
System state dumped to
/var/opt/SUNWSMS/SMS1.3/adm/A/dump/xcstate.031016.1451.30
Boards in dump: master SC CPs/CSBs[1:0]: 3
EXB[17:0]: 20000
Slot0[17:0]: 20000
Slot1[17:0]: 00000
Recordstop analysis resulting in FAIL, handled as Dstop
Workaround
To recover the domain for this issue, use the "setkeyswitch off" and "setkeyswitch on" commands to force a regular (i.e. not -Q) POST to run. Normally, SMS will rerun POST without a "-Q" automatically after this type of issue. However, allowing the automatic rerun of POST is the recommended method.
Resolution
This issue is addressed in the following releases:
SPARC Platform
-
Sun Fire 12K/15K with SMS 1.3 patch 114608-09 or later
-
Sun Fire 12K/15K/20K/25K with SMS 1.4.1 patch 117371-02 or later
For Sun Fire 12K/15K running SMS 1.2 and 1.4, an upgrade to SMS 1.4.1 is required.
Modification History
AttachmentsThis solution has no attachment