Sun Fire 12K/15K Domain May "Dstop" When Another Domain Sharing an Expander is Performing a "setkeyswitch on/off" or a DR Operation |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 12K Server Sun Fire 15K Server
|
| Bug Id : | 4943895
|
| Date of Resolved Release : | 13-JAN-2004
|
Impact
Upon interrupting "hpost", a Sun Fire 12K/15K domain may Domain Stop (Dstop) another domain if it has a shared expander anywhere in the domain configuration. This issue may occur when "hpost" (run by either "setkeyswitch on" or DR) is interrupted on a domain which is sharing an expander with another domain. It is possible that this issue may also occur during domain shutdown. Domains which are not sharing the expander are not susceptible to this issue.
Note: Interruption of the "hpost" can be in the form of either manual intervention (e.g. "ctrl-c" of the executing command) or if a domain is setkeyswitched to off/standby while it is being recovered by "dsmd".
Contributing Factors
This issue can occur in the following releases:
SPARC Platform
-
Sun Fire 12K/15K with System Management Software (SMS) 1.1
-
Sun Fire 12K/15K with System Management Software (SMS) 1.2
-
Sun Fire 12K/15K with System Management Software (SMS) 1.3 without patch 114640-09
To determine whether a system has a shared expander or not, run the following command:
% showboards
Location Pwr Type of Board Board Status Test Status Domain
-------- --- ------------- ------------ ----------- ------
SB12 On CPU Available Unknown A
IO12 On HPCI Available Unknown B
In the above example, note that SB12 is assigned to Domain A and IO12 is assigned to Domain B. This indicates a split expander configuration.
Symptoms
Should the described issue occur, "dstop wfail" strings similar to the following may be found in the HW state dump directory "/var/opt/SUNWSMS/adm/X/dump":
SDI EX06/S0: All SDI is DStopped and RStopped, requested by
DARB.
SDI EX07/S0 Master_Stop_Status0[31:0] = 0000010F
MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
MStop0[8]: L1 Slot0 Ecc error line detected
SDI EX07/S0 Slot[1:0][3:0] Ecc Error Count = 0 0
SDI EX07/S0 Dstop0[31:0] = 000C8008
Dstop0[18]: D DARB texp requests Slot1 Dstop (M)
Dstop0[19]: D 1E SDI internal core requested Dstop
SDI EX07/S0 Core_Error0[31:0] = 02008200 Mask = 0051FFFF
CoreErr0[25]: D 1E Command pool timeout, non-split exp (M)
valid_{slot_wr[1:0],read}_TO = 1 (rev 4+)
{cmd_pool_loc[5:0],cmd4io,retired,half_used} = 020
FAIL EXB EX7: Dstop/Rstop detected by SDI EX7/S0.
Note: In the above example, X is a reference to the domain number.
Workaround
There is no workaround. Please see the "Resolution" section below.
Resolution
This issue is addressed in the following release:
SPARC
-
Sun Fire 12K/15K with System Management Software (SMS) 1.3 with patch 114640-09 or later
Note: SMS 1.1 and 1.2 require an upgrade to a later release.
Modification History
AttachmentsThis solution has no attachment