Sun Fire 12K/15K Domains May Experience Device Timeouts During SMS Startup Resulting in "Dstop" |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 12K Server Sun Fire 15K Server
|
| Bug Id : | 4848931
|
| Date of Workaround Release : | 18-AUG-2003
|
| Date of Resolved Release : | 14-OCT-2003
|
Impact
In rare occurrences, Sun Fire 12K/15K domains may experience device timeouts during SMS startup resulting in "Dstop". When this issue occurs, all domains in the platform may be abruptly halted, also known as a "domain stop" (DStop).
Contributing Factors
This issue can occur in the following releases:
SPARC Platform
-
Sun Fire 12K/15K with SMS 1.1
-
Sun Fire 12K/15K with SMS 1.2 without patch 112481-14
-
Sun Fire 12K/15K with SMS 1.3 without patch 114640-06
Symptoms
This issue occurs when SMS is starting to assume the MAIN role. This can happen when the System Controller (SC) is rebooted and SMS first starts running, or it can happen when a failover occurs, causing the SC which was previously running as SPARE to assume the role of MAIN.
Messages indicating the presence of this issue will appear in the "/var/opt/SUNWSMS/adm/platform/messages" file on the SC which is becoming MAIN.
Under normal SMS startup conditions, SC console messages similar to the following will appear:
Apr 29 14:48:23 2003 sc1 ssd[311]: [0 99863813331 NOTICE SSDWorkArea.cc 38]
SMS software start-up initiated
Apr 29 14:48:23 2003 sc1 ssd[311]: [0 99886813296 NOTICE SSDWorkArea.cc 38]
SC POST results: 'CP1500 POST Passed; SSC POST v1.18 Passed'
Apr 29 14:48:28 2003 sc1 fomd[397]: [8599 105112577341 NOTICE FMHeartbeat.cc 232]
Checking for SC heartbeat interrupts (can take up to 15 seconds) ...
When SMS assumes the role of MAIN due to a failover, messages similar to the following will appear:
Jun 18 17:16:06 2003 sc1 fomd[406]: [8573 8860802215595 NOTICE FailoverMgr.cc 1983]
Taking over the main role because the remote SC current Main) has a fault -
Forced Failover
This issue is present if either of the above sets of messages is followed immediately by device timeout messages similar to the following:
Jun 18 17:15:12 2003 sc1 hwad[349]: [1123 8807230392886 ERR I2cComm.cc 407]
I2c read time out - bus: 36, address: 27
Jun 18 17:15:14 2003 sc1 hwad[349]: [1123 8809053779576 ERR I2cComm.cc 407]
I2c read time out - bus: 52, address: 27
The mode of failure is different, depending on which device happens to timeout. For example, when a controller system board (CSB) times out, the following message will appear:
Jul 11 13:15:59 2002 sc0 fomd[2413]: [6117 1446140187195 ERR L2PowerControl.cc 486]
Error (2) returned from attempting to get power converter readings on CSB at CS0
Jul 11 13:15:59 2002 sc0 fomd[2413]: [6202 1446143025381 ERR CSBPowerControl.cc 554]
Failed to determine if CSB at CS0 is on or off.
Jul 11 13:15:59 2002 sc0 fomd[2413]: [8504 1446143975373 ERR FailoverMgr.cc 3275]
Error getting power state for CSB at CS0 (ecode = 6202)
Jul 11 13:16:02 2002 sc0 fomd[2413]: [6117 1449240351512 ERR L2PowerControl.cc 486]
Error (2) returned from attempting to get power converter readings on EXB at EX16
and when a Max CPU board times out:
Jun 18 17:15:50 2003 sc1 frad[400]: [9916 8844311239199 ERR SeepromInfoPro.cc 1956]
Seeprom Info HWAD proxy call failed on HPCI at IO4, ecode: 2
Jun 18 17:16:23 2003 sc1 hwad[349]: [1174 8877339642485 ERR PciComm.cc 232]
console bus device failed to respond correctly at address 31200000
Jun 18 17:16:23 2003 sc1 esmd[13577]: [1941 8877343422657 ERR DetectorS.cc 835]
Failed to read state point core_c_pwr_on0, located on MCPU at IO4: ecode=1174
There may be other indications that this issue has occurred with different devices.
Workaround
Normally, SMS will reboot a domain automatically after it experiences a "Dstop". If the domain doesn't recover automatically, the "setkeyswitch off" and "setkeyswitch on" commands should be used to recover the domain. Disabling failover on the System Controllers can also be used as a temporary workaround.
Resolution
This issue is addressed in the following releases:
SPARC Platform
-
Sun Fire 12K/15K with System Management Software (SMS) 1.2 with patch 112481-14 or later
-
Sun Fire 12K/15K with System Management Software (SMS) 1.3 with patch 114640-06 or later
Note: SMS 1.1 requires an upgrade to a later release.
Modification HistoryDate: 14-OCT-2003
-
Change State to Resolved
-
Updated Contributing Factors and Resolution sections
AttachmentsThis solution has no attachment