Sun Fire 12K/15K Domains May Experience Device Timeouts During SMS Startup Resulting in "Dstop"



Category :Availability
Release Phase :Resolved
Product :Sun Fire 12K Server
Sun Fire 15K Server  
Bug Id :4848931  
Date of Workaround Release :18-AUG-2003 
Date of Resolved Release :14-OCT-2003 


Impact

In rare occurrences, Sun Fire 12K/15K domains may experience device timeouts during SMS startup resulting in "Dstop". When this issue occurs, all domains in the platform may be abruptly halted, also known as a "domain stop" (DStop).


Contributing Factors

This issue can occur in the following releases:

SPARC Platform

  • Sun Fire 12K/15K with SMS 1.1
  • Sun Fire 12K/15K with SMS 1.2 without patch 112481-14
  • Sun Fire 12K/15K with SMS 1.3 without patch 114640-06

Symptoms

This issue occurs when SMS is starting to assume the MAIN role. This can happen when the System Controller (SC) is rebooted and SMS first starts running, or it can happen when a failover occurs, causing the SC which was previously running as SPARE to assume the role of MAIN.

Messages indicating the presence of this issue will appear in the "/var/opt/SUNWSMS/adm/platform/messages" file on the SC which is becoming MAIN.

Under normal SMS startup conditions, SC console messages similar to the following will appear:

    Apr 29 14:48:23 2003 sc1 ssd[311]: [0 99863813331 NOTICE SSDWorkArea.cc 38]
    SMS software start-up initiated
    Apr 29 14:48:23 2003 sc1 ssd[311]: [0 99886813296 NOTICE SSDWorkArea.cc 38]
    SC POST results:  'CP1500 POST Passed; SSC POST v1.18 Passed'
    Apr 29 14:48:28 2003 sc1 fomd[397]: [8599 105112577341 NOTICE FMHeartbeat.cc 232]
    Checking for SC heartbeat interrupts (can take up to 15 seconds) ...

When SMS assumes the role of MAIN due to a failover, messages similar to the following will appear:

    Jun 18 17:16:06 2003 sc1 fomd[406]: [8573 8860802215595 NOTICE FailoverMgr.cc 1983]
    Taking over the main role because the remote SC current Main) has a fault -
    Forced Failover

This issue is present if either of the above sets of messages is followed immediately by device timeout messages similar to the following:

    Jun 18 17:15:12 2003 sc1 hwad[349]: [1123 8807230392886 ERR I2cComm.cc 407]
    I2c read time out -  bus: 36, address: 27
    Jun 18 17:15:14 2003 sc1 hwad[349]: [1123 8809053779576 ERR I2cComm.cc 407]
    I2c read time out -  bus: 52, address: 27

The mode of failure is different, depending on which device happens to timeout. For example, when a controller system board (CSB) times out, the following message will appear:

    Jul 11 13:15:59 2002 sc0 fomd[2413]: [6117 1446140187195 ERR L2PowerControl.cc 486]
    Error (2) returned from attempting to get power converter readings on CSB at CS0
    Jul 11 13:15:59 2002 sc0 fomd[2413]: [6202 1446143025381 ERR CSBPowerControl.cc 554]
    Failed to determine if CSB at CS0 is on or off.
    Jul 11 13:15:59 2002 sc0 fomd[2413]: [8504 1446143975373 ERR FailoverMgr.cc 3275]
    Error getting power state for CSB at CS0 (ecode = 6202)
    Jul 11 13:16:02 2002 sc0 fomd[2413]: [6117 1449240351512 ERR L2PowerControl.cc 486]
    Error (2) returned from attempting to get power converter readings on EXB at EX16

and when a Max CPU board times out:

    Jun 18 17:15:50 2003 sc1 frad[400]: [9916 8844311239199 ERR SeepromInfoPro.cc 1956]
    Seeprom Info HWAD proxy call failed on HPCI at IO4, ecode: 2
    Jun 18 17:16:23 2003 sc1 hwad[349]: [1174 8877339642485 ERR PciComm.cc 232]
    console bus device failed to respond correctly at address 31200000
    Jun 18 17:16:23 2003 sc1 esmd[13577]: [1941 8877343422657 ERR DetectorS.cc 835]
    Failed to read state point core_c_pwr_on0, located on MCPU at IO4: ecode=1174

There may be other indications that this issue has occurred with different devices.


Workaround

Normally, SMS will reboot a domain automatically after it experiences a "Dstop". If the domain doesn't recover automatically, the "setkeyswitch off" and "setkeyswitch on" commands should be used to recover the domain. Disabling failover on the System Controllers can also be used as a temporary workaround.


Resolution

This issue is addressed in the following releases:

SPARC Platform

  • Sun Fire 12K/15K with System Management Software (SMS) 1.2 with patch 112481-14 or later
  • Sun Fire 12K/15K with System Management Software (SMS) 1.3 with patch 114640-06 or later

Note: SMS 1.1 requires an upgrade to a later release.




Modification History


Date: 14-OCT-2003
  • Change State to Resolved
  • Updated Contributing Factors and Resolution sections



Attachments
This solution has no attachment

 
 
Login Required

You must login and have a valid contract to access Sun's Premium content which includes:

  • Sun Alerts
  • Bugs
  • Patches
  • Solutions
  • White Papers
  • Documentation
  • Support Knowledge

Login Required

You must login and have a valid contract to access Sun's contracted features

Access Legend:

(Login to access)   Sun Contracted Content
(Login to access)   Sun Contracted Feature

Please make use of SunSolve Feedback application by selecting the floating [+] to provide feedback about this specific document.

Search

Article Details
Article ID : 201458
Article Type : Sun Alert
Last reviewed : 2003-10-14
Audience : PUBLIC
Keywords :
Provide feedback  (help)
Page Tools
»  Print This Page
»  Email This Article
»  Bookmark This Article
 
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright Sun Microsystems, Inc. | SunSolve Version 7.4.0 #1