CPU0/CPU1 May Be Disabled on Sun Fire 12K/15K System Boards Resulting in Domain Interruption |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun Fire 12K Server Sun Fire 15K Server
|
| Bug Id : | 4830870, 4865526
|
| Date of Workaround Release : | 18-APR-2003
|
| Date of Resolved Release : | 08-JAN-2004
|
Impact
System Management Software (SMS) may disable the CPU0/CPU1 pair on Sun Fire 12K/15K System Boards due to a false over-voltage reading for CPU1. There will be a domain interruption while SMS brings down the domain, removes the CPU0/CPU1 pair from service, and brings the domain back up.
Contributing Factors
This issue can occur in the following releases:
SPARC Platform
-
Sun Fire 12K/15K with SMS 1.1
-
Sun Fire 12K/15K with SMS 1.2 without patch 112481-15
-
Sun Fire 12K/15K with SMS 1.3 without patch 114640-08
Symptoms
When this issue occurs, the CPU0/CPU1 pair on one of the System Boards would be disabled by SMS. The SMS "showcomponent" command will show CPU0/CPU1 (PP0) as blacklisted.
% showcomponent -a
Component PROCPAIR at SB6/PP0 is disabled in specified blacklist:
# ESMD High-Maximum Voltage 0313.1238.15
The following type of message will appear in the SMS log file "/var/opt/SUNWSMS/adm/platform/messages":
Apr 8 18:49:49 2003 ... esmd[630]: [...] A high voltage has been
detected on Core1, located on CPU at SB6. The voltage detected is
1.45v; should be 1.31v to 1.44v. PROCPAIR at SB6/PP0 is being
removed from the domain and powered off. Check all hardware for
the cause.
Apr 8 18:49:50 2003 ... esmd[630]: [...] Component PROCPAIR at
SB6/PP0 has been blacklisted
Note: This issue only occurs for CPU1, due to its position on the System Board.
Workaround
If CPU0/CPU1 have been disabled, they can be enabled with:
#enablecomponent -a SB6/PP0
Note: Use appropriate System Board and processor pair values.
Resolution
This issue is addressed in the following releases:
SPARC Platform
Notes: SMS 1.1 requires an upgrade to a later release with appropriate patches.
The above resolution implements a change in SMS's behavior. When a processor core voltage is detected above or below a warning threshold a message will be logged to the SMS platform messages file (/var/opt/SUNWSMS/adm/platform/messages). The board will continue to run and the domain will be unaffected. An example warning message is shown below. This can most easily be found by searching the platform message files for the string "Core".
Dec 17 16:41:45 2003 platform-sc0 esmd[564]: [0 108891482234403
ERR DetectorV.cc 645] A high voltage has been detected on Core3,
located on CPU at SB7. The voltage detected is 1.48v; should be 1.31v
to 1.47v. PROCPAIR at SB7/PP1 will be removed from the domain and
powered off if it rises above 1.65v.
The warning condition can also be observed with the SMS "showenvironment" command. For example,
% showenvironment -p volts | grep Core
CPU at SB6 pcf8591 Core 0 Volt 1.65 V 55.8 sec OK
CPU at SB6 pcf8591 Core 1 Volt 1.65 V 55.8 sec OK
CPU at SB6 pcf8591 Core 2 Volt 1.63 V 55.8 sec OK
CPU at SB6 pcf8591 Core 3 Volt 1.63 V 55.8 sec OK
CPU at SB7 pcf8591 Core 0 Volt 1.40 V 45.6 sec OK
CPU at SB7 pcf8591 Core 1 Volt 1.43 V 45.6 sec OK
CPU at SB7 pcf8591 Core 2 Volt 1.41 V 45.6 sec OK
CPU at SB7 pcf8591 Core 3 Volt 1.48 V 45.6 sec
HIGH_WARN <---Note the high warning
CPU at SB8 pcf8591 Core 0 Volt 1.66 V 55.1 sec OK
CPU at SB8 pcf8591 Core 1 Volt 1.64 V 55.1 sec OK
CPU at SB8 pcf8591 Core 2 Volt 1.64 V 55.1 sec OK
CPU at SB8 pcf8591 Core 3 Volt 1.63 V 55.1 sec OK
When either a high or low warning condition is found it is important that you contact your authorized Sun Services Representative to have the System Board replaced. This should be done as soon as the condition is detected.
Modification HistoryDate: 30-APR-2003
-
Updated Impact
-
Updated Symptoms
-
Updated Synopsis
Date: 14-MAY-2003
-
Updated Relief/Workaround section
Date: 15-MAY-2003
-
Updated Relief/Workaround section
Date: 04-JUN-2003
-
Updated Relief/Workaround section
Date: 08-JAN-2004
-
Updated Avoidance
-
Updated State: Resolved
-
Updated Contributing Factors, Relief/Workaround and Resolution sections
Date: 15-JAN-2004
AttachmentsThis solution has no attachment