Sun Fire 12K/15K May Experience I2C Timeouts When SMS is Simultaneously Started on Both System Controllers



Category :Availability
Release Phase :Resolved
Product :Sun Fire 12K Server
Sun Fire 15K Server
System Management Services 1.3 Software
System Management Services 1.4.1 Software  
Bug Id :4870812, 5028844  
Date of Workaround Release :01-APR-2004 
Date of Resolved Release :15-APR-2004 


Impact

If SMS is started on both system controllers (SC) at the same time, the main SC will not be able to obtain control of the I2C bus after SSC POST completes. SMS will function properly, until the failover state changes from ACTIVE to FAILED, resulting in SMS (esmd) shutdown of the platform. This is due to the inability of SMS to correctly read the power library state.


Contributing Factors

This issue can occur in the following releases:

SPARC Platform

  • Sun Fire 12K/15K with SMS 1.3 (for Solaris 8 and 9) without patch 114640-13
  • Sun Fire 12K/15K with SMS 1.4.1 (for Solaris 8 and 9) without patch 117866-05

Note: SMS 1.1, 1.2 and 1.4 are not affected by this issue.

This issue occurs when the System Management Software (SMS) startup is initiated at the same time on both system controllers (SC0, SC1), either manually via the command line </etc/init.d/sms stop/start>; or automatically as result of an SC reset/reboot. The simultaneous SMS start-up will result in both the I2c and console bus being disabled, resulting in a dsmd reset of domains due to perceived loss of power.


Symptoms

Should the described issue occur, messages (similar to the following) indicating the presence of this issue will appear in the "/var/opt/SUNWSMS/adm/platform/messages" file on the MAIN System Controller:

SMS 1.4.1 events:

Oct 25 19:51:20 2004 sl57v00-sc0 fomd[571]: [8569 112844283028 NOTICEFailoverMgr.cc 1279] 
The I2 network test FAILED
Oct 25 19:51:22 2004 sl57v00-sc0 hwad[526]: [1123 115024455829 ERRI2cComm.cc 716] 
I2c read time out - bus: 3, address: 20
Oct 25 19:51:22 2004 sl57v00-sc0 hwad[526]: [1123 115809322267 ERRI2cComm.cc 716]
I2c read time out- bus: 4, address: 20
Oct 25 19:51:23 2004 sl57v00-sc0 hwad[526]: [1123 116310497838 ERR I2cComm.cc 716]
I2c read time out -  bus: 5, address: 20

Oct 25 19:52:00 2004 sl57v00-sc0 fomd[571]: [8570 153131011057 NOTICE FailoverMgr.cc 2839] 
Reset the remote SC

Oct 25 19:52:15 2004 sl57v00-sc0 fomd[571]: [6117 168328989841
ERR L2PowerControl.cc 464] Error (1123) returned from attempting to getpower converter 
readings on CSB at CS0
Oct 25 19:52:15 2004 sl57v00-sc0 fomd[571]: [6202 168338772005
ERR CSBPowerControl.cc 543] Failed to determine if CSB at CS0 is on or off.
Oct 25 19:52:15 2004 sl57v00-sc0 fomd[571]: [8504 168339797190
ERRFailoverMgr.cc 3360] Error getting power state for CSB at CS0 (ecode = 6202)
Oct 25 19:52:16 2004 sl57v00-sc0 hwad[526]: [1123 168844882433
ERR I2cComm.cc 410] I2c read time out -  bus: 21, address: 22
Oct 25 19:52:16 2004 sl57v00-sc0 fomd[571]: [6117 168846086283
ERRL2PowerControl.cc 464] Error (1123) returned from attempting to get power 
converter readings on CSB at CS1
Oct 25 19:52:16 2004 sl57v00-sc0 fomd[571]: [6202 168855445252
ERRCSBPowerControl.cc 543] Failed to determine if CSB at CS1 is on or off.
Oct 25 19:52:16 2004 sl57v00-sc0 fomd[571]: [8504 168856506650 ERR
FailoverMgr.cc 3360] Error getting power state for CSB at CS1 (ecode = 6202)
Oct 25 19:52:16 2004 sl57v00-sc0 hwad[526]: [1123 169361539121
ERR I2cComm.cc 410] I2c read time out -  bus: 0, address: 22
Oct 25 19:52:16 2004 sl57v00-sc0 fomd[571]: [6117 169362808566 ERR
L2PowerControl.cc 464] Error (1123) returned from attempting to get power 
converter readings on EXB at EX0
Oct 25 19:52:16 2004 sl57v00-sc0 fomd[571]: [6202 169364788890
ERR EXBPowerControl.cc 878] Failed to determine if EXB at EX0 is on or off.

Oct 25 19:52:29 2004 sl57v00-sc0 esmd[1672]: [6199 182142394363 WARNING
CSBPowerControl.cc 562] Component CSB at CS0 is ON but the Console Bus is not 
enabled.

Oct 25 19:53:15 2004 sl57v00-sc0 dsmd[1661]-A(): [2559 228033758983
NOTICE DomainMon.cc 382] Domain A was in the OS running state when a power 
failure happened
Oct 25 19:53:15 2004 sl57v00-sc0 dsmd[1661]-A(): [2566 228034971576 NOTICE 
DomainMon.cc 400] CS0 power is off
Oct 25 19:53:15 2004 sl57v00-sc0 dsmd[1661]-A(): [2566 228036013383
NOTICE DomainMon.cc 400] CS1 power is off
Oct 25 19:53:26 2004 sl57v00-sc0 dsmd[1661]-A(): [2554 238920133621
NOTICE DomainsPatrol.cc 61] Domain A is no longer running; attempting to 
restore it to the OS running state

SMS 1.3 events:

Mar 11 15:57:30 2004 <host-name> ssd[19831]: [1304 11160492621807820 NOTICE StartupManager.cc 2700] 
software component start-up initiated: name=fomd
Mar 11 15:57:38 2004 <host-name> fomd[19854]: [8599 11160500519702156 NOTICE FMHeartbeat.cc 217] 
Checking for SC heartbeat interrupts (can take up to 30 seconds) ...
Mar 11 15:58:23 2004 <host-name> hwad[19839]: [1123 11160545082921286 ERR I2cComm.cc 410] I2c read 
time out -  bus: 22, address: 25
Mar 11 15:58:23 2004 <host-name> hwad[19839]: [1123 11160545857277268 ERR I2cComm.cc410] I2c read 
time out - bus: 22, address: 25 ...
Mar 11 16:13:41 2004 <host-name> esmd[20092]: [1981 11161463782380791 ERR DetectorV.cc 626] Failed to 
read 5.0VDC, located on SCPER at SCPER0 too many times. Sensor will no longer be monitored.
Mar 11 16:13:41 2004 <host-name> esmd[20092]: [1981 11161463848280646 ERR DetectorV.cc 626] Failed to 
read +12VDC, located on SCPER a at SCPER0 too many times. Sensor will no longer be monitored.
Mar 11 16:13:41 2004 <host-name> esmd[20092]: [1981 11161463855862737 ERR DetectorV.cc 626] Failed to 
read 3.3VHK, located on SCPER at SCPER0 too many times. Sensor will no longer be monitored.
Mar 11 16:13:41 2004 <host-name> esmd[20092]: [1982 11161463861798461 ERR SysControl.cc 1244] Too many 
sensors are bad. SCPER at SCPE  R0 is being deconfigured and powered off. Check all hardware for the 
cause.
Mar 11 16:13:41 2004 <host-name> esmd[20092]: [1964 11161463869142203 CRIT esmdUtls.cc 131] Power off 
of SC at SC0 is required, but the    failover state is "disabled"; all domains will be shutdown and 
the system powered off.
Mar 11 16:13:42 2004 <host-name> esmd[20092]: [1979 11161464020961869 NOTICECabinet.cc 1268] Sending 
domain shutdown request for all domains.

 


Workaround

To avoid generating this issue, refrain from initiating simultaneous SMS startup on the SCs. If SMS needs to be started on both SCs, or if a reboot/reset of the SCs is required, wait until one SC completes startup and becomes the main SC. Another way to mitigate occurrence of this issue is to verify the eeprom setting of the system controller as follows:

    auto-boot?    true
    diag-switch?  true
    On SC0:       diag-level pmax-epmax
    On SC1:       diag-level pmax-epvmax

"pmax-epmax" and "pmax-epvmax" perform the same diagnostics functions. The "pmax-epvmax" provides verbose messages. The reason these values are set differently is one of timing. SC1 with the setting of <diag-level pmax-epvmax> would take longer to reboot, thus allowing SC0 to complete start-up first.


Resolution

This issue is addressed in the following releases:

SPARC Platform

  • Sun Fire 12K/15K with SMS 1.3 (for Solaris 8 and 9) with patch 114640-13 or later
  • Sun Fire 12K/15K with SMS 1.4.1 (for Solaris 8 and 9) with patch 117866-05 or later



Modification History


Date: 25-OCT-2004
  • Clarification on "Impact" and "Symptoms" statements

Date: 15-APR-2004
  • Updated Contributing Factors and Resolution sections for patch release
  • Re-release as Resolved

Date: 21-JUN-2005

21-Jun-2005:

  • Update Impact, Contributing Factors, Workaround, and Resolution sections



Attachments
This solution has no attachment

 
 
Login Required

You must login and have a valid contract to access Sun's Premium content which includes:

  • Sun Alerts
  • Bugs
  • Patches
  • Solutions
  • White Papers
  • Documentation
  • Support Knowledge

Login Required

You must login and have a valid contract to access Sun's contracted features

Access Legend:

(Login to access)   Sun Contracted Content
(Login to access)   Sun Contracted Feature

Please make use of SunSolve Feedback application by selecting the floating [+] to provide feedback about this specific document.

Search

Article Details
Article ID : 228538
Article Type : Sun Alert
Last reviewed : 2005-06-21
Audience : PUBLIC
Keywords :
Provide feedback  (help)
Page Tools
»  Print This Page
»  Email This Article
»  Bookmark This Article
 
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright Sun Microsystems, Inc. | SunSolve Version 7.4.0 #1