Sun Fire 12K and Sun Fire 15K Domain May Panic or "Dstop" During the Reboot Sequence |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 12K Server Sun Fire 15K Server
|
| Bug Id : | 4753686
|
| Date of Workaround Release : | 21-MAR-2003
|
| Date of Resolved Release : | 06-MAY-2003
|
Impact
A Sun Fire 12K/15K domain may panic or "Dstop" during the reboot using either the reboot(1M) or the "init 6" command from the root user prompt. The domain may panic with a "safari bus error", or it may "Dstop" with an "Intrupt MappedIn not seen" error, or it may do both.
Contributing Factors
This issue can occur in the following releases:
SPARC Platform
-
Sun Fire 12K/15K with System Management Software (SMS) 1.1
-
Sun Fire 12K/15K with System Management Software (SMS) 1.2 without patch 112488-12
-
Sun Fire 12K/15K with System Management Software (SMS) 1.3 without patch 114608-02
Note: The issue may occur if there are fewer CPU modules than IO controllers (2 per hPCI board, 1 per wPCI board).
Symptoms
The following are sample panic messages:
Mar 9 15:19:59 2002 ECC_Ctrl=e000000000000000
Mar 9 15:19:59 2002 UE_AFSR=00000000000fe1ff UE_AFAR=00000f8000000000
Mar 9 15:19:59 2002 CE_AFSR=00000000000fe1ff CE_AFAR=00000f8000000000
Mar 9 15:19:59 2002 panic[cpu0]/thread=10408000: Safari bus error: CSR=... ErrCtrl=fc00...
Sep 9 15:19:59 2002 IntrCtrl=8000000000000017 ErrLog=0000000000000010
Sep 9 15:19:59 2002 ECC_Ctrl=e000000000000000
The following will be seen in the corresponding "dsmd.dstop" state dump fter a DSTOP:
redxl> wfail
SDI EX00/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX01/S0: Slot 1 port is DStopped, SDI is RStopped, requested by DARB.
SDI EX02/S0: Slot 1 port is DStopped, SDI is RStopped, requested by DARB.
SDI EX03/S0: Slot 1 port is DStopped, SDI is RStopped, requested by DARB.
SDI EX04/S0 Master_Stop_Status0[31:0] = 9000008F
MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX04/S0 Dstop0[31:0] = 2004A000
Dstop0[18]: D DARB texp requests Slot1 Dstop (M)
Dstop0[29]: D 1E Slot1 asserted Error, enabled to cause Dstop (M)
EPLD IO04 Err1_Dom1: Mask= B0 Err= 40 1stErr= 40
Err1[6]: 1E+ Error reported by BBC0
BBC IO04/BB0 Device_Err_Stat[31:0] = 80008100
DevErr[ 8]: 1E Port 0 Safari device asserted error
PCI IOC IO04/P0 Safari_Err_Log[63:0] = 80000000 00000210
ErrLog[ 4]: Intrupt MappedIn not seen for trans init'd by PCI IOC
ErrLog[ 9]: ErrOut Timeout on head of CI queue
ErrLog[63]: Error Out asserted (S_ERROR_L pin)
FAIL Port IO4/P0: Dstop detected by BBC IO4/BB0.
Workaround
To work around the described issue, do not use reboot(1M) or the "init 6" command to reboot a failing domain.
Reboot the domain as follows:
-
shutdown domain as user root using shutdown(1M)
-
as user "sms-svc" on system controller, setkeyswitch off/on
Alternatively, make the number of CPU modules greater than or equal to the number of IO controllers.
Or, modify the domain's power on self test operation by adding the following to the platform and/or domain specific ".postrc" file(s) as required (for example, to the "/etc/opt/SUNWSMS/SMS1.2/config/platform/.postrc" file):
dash_Q_fail
Note: The "dash_Q_fail" condition variable was introduced in SMS 1.2 with patch 112488-10, and is available in SMS 1.3. It is not available in SMS 1.1. For this reason one must upgrade to SMS 1.2 with patch 112488-10 or later, or upgrade to SMS 1.3 prior to using this workaround.
Resolution
This issue is addressed in the following releases:
SPARC Platform
-
Sun Fire 12K/15K with System Management Software (SMS) 1.2 with patch 112488-12 or later
-
Sun Fire 12K/15K with System Management Software (SMS) 1.3 with patch 114608-02 or later
Note: SMS 1.1 will require an upgrade to SMS 1.2 or SMS 1.3 with the appropriate patch.
Modification HistoryDate: 27-MAR-2003
-
minor modification to Relief/Workaround section
Date: 06-MAY-2003
-
State: Resolved
-
Updated Contributing Factors and Resolution section
AttachmentsThis solution has no attachment