Sun Systems Equipped With Schizo ASICs Version 2.3 or Higher May Experience Either Domain Stop (Dstop), Domain Pause or FATAL RESET Under Heavy I/O |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Solaris 9 Operating System Solaris 8 Operating System
|
| Bug Id : | 4897386
|
| Date of Workaround Release : | 13-APR-2004
|
| Date of Resolved Release : | 12-JUL-2004
|
Impact
Sun systems with Schizo ASIC version 2.3 or higher may encounter Domain Stop (Dstop) on high end servers (Sun Fire 12K/15K/20K/25K), Domain Pause on Sun Fire 6800 or Fatal Reset on desktop servers (Sun Fire 280R, V480/V490/V880/V890, and Netra 20) during heavy I/O activities.
Contributing Factors
This issue can occur in the following releases:
SPARC Platform
on the following systems which have Schizo ASIC 2.3 or higher:
- Sun Fire 12K/15K/20K/25K
- Sun Fire 6800 with Sun Fire Link
- Sun Fire V480/V490
- Sun Fire 280R
- Sun Fire V880/V890
- Netra 20
- Sun Blade 1000/2000
The described issue may occur on Schizo versions 2.3 or higher.
To determine if a system is equipped with Schizo ASIC (and the version), use the following command:
For Sun Fire 12K/15K/20K/25K servers:
% redx -x "shioc 5 1 0" | grep Component
PCI IOC IO05/P0 Component ID = 107BE06D TO_2.3
In the "shioc 5 1 0" section of the command, the "5" indicates the expander, the "1" is the slot (always a 1 for I/O boards), the "0" is the Schizo and the "TO_2.3" indicates Schizo version 2.3.
Note: Using "redx" on a production system needs to be done carefully. Incorrect usage can cause a loss of availability.
For Sun Fire 6800 systems:
# /usr/platform/sun4u/sbin/prtdiag -v
Port
FRU Name Model ID Status Version
/N0/IB6/P0 SUNW,schizo 24 ok 4 <--Schizo 2.2
/N0/IB6/P1 SUNW,schizo 25 ok 4
/N0/IB9/P0 SUNW,schizo 30 ok 5 <--Schizo 2.3
/N0/IB6/P0 SUNW,sgsbbc 24 ok 2
/N0/IB9/P0 SUNW,sgsbbc 30 ok
Use the following reference to determine the Schizo ASIC version:
Version Schizo Rev
5 2.3
4 2.2
Note 1: Only Sun Fire 6800 systems with Sun Fire Link utilize Schizo ASIC version 2.3.
Note2 : The above command will need to be executed in each domain.
For Sun Fire 280R, V480/V490, V880/V890 and Netra 20 systems:
# /usr/platform/sun4u/sbin/prtdiag -v
IO ASIC revisions:
Port
Brd Model ID Status Version
IB-1 unknown 8 ok 4
IB-1 unknown 9 ok 4
Use the following reference to determine the Schizo ASIC version:
Version Schizo Rev
7 2.5
6 2.4
5 2.3
4 2.2
Symptoms
Should the described issue occur, Sun Fire 280R, V480, V880 and Netra 20 systems may encounter a FATAL RESET. The reset output can only be seen via attaching console logging to ttya. The only indication a system's OS may have when experiencing a FATAL RESET is a one line message on reboot stating "System booting after fatal error FATAL" which is seen in the "/var/adm/messages " file.
The example below is the partial console output of a FATAL RESET. It has been edited (reduced) for the purpose of fitting within this SunAlert.
ERROR: System "FATAL RESET" from CPU0 CPU2 CPU3 CPU5 CPU6 CPU7
System State (CPU7 reporting)
CPU1 Config/Control/Status registers:
CPUVersion: 003e.0015.2200.0507
SafConfig: 0caa.01bc.0003.0002
SafBaseAdr: 0000.0400.0080.0000
DCacheCtl: 0000.0000.0000.0000
ECacheCtl: 0000.0000.0329.4400
ECErrEnable: 0000.0000.0000.000b
AFAR: 0000.00d3.fe82.f990
AFSR: 0008.0000.0000.0000 PERR
AFAR 2: 0000.00d3.fe82.f990
AFSR 2: 0008.0000.0000.0000 PERR
DMMU SFAR: 0000.0003.810b.0e70
DMMU SFSR: 0000.0000.0080.8000 TM
IMMU SFSR: 0000.0000.0080.8000 TM
When FATAL RESETs are experienced important information needs to be gathered. The system configuration needs to be known. It's imperative to know what the system was doing at the time of the failure. This issue has not been seen when a system is idle. The console log needs to be gathered to decode the FATAL RESET output to help determine what the cause is.
Note: FATAL RESETs are caused by many different contributors from hardware to software and it should not be assumed that all FATAL RESETs are caused by this issue. This needs to be determined or ruled out before going forward.
A Domain Pause on an affected Sun Fire 6800 would look similar to the following in the showlogs output:
Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 920266 local0.crit]
ErrorMonitor: Domain A has a SYSTEM ERROR
Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 920266 local0.crit]
ErrorMonitor: Domain A has a SYSTEM ERROR
Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 137548 local0.error]
/N0/SB0 encountered the first error
Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 329171 local0.error]
RepeaterSbbcAsic reported first error on /N0/SB0
Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 172596 local0.error]
/SB0/bbcGroup1/sbbc1:
>>> ErrorStatus[0x80] : 0x80008200
FE [15:15] : 0x1
ErrSum [31:31] : 0x1
SafErr [09:08] : 0x2 Fireplane device asserted an error
Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 885793 local0.error]
/SB0/bbcGroup1/cpuCD/cpusafariagent1:
AFAR (high)[0x531] : 0x0000000c
AFAR [42:32] [10:00] : 0xc
AFAR (low)[0x541] : 0x06000c00
AFAR_2 (high)[0x571] : 0x0000000c
AFAR_2 [42:32] [10:00] : 0xc
AFAR_2 (low)[0x581] : 0x06000c00
AFSR (high)[0x551] : 0x00080000
PERR [19:19] : 0x1
AFSR_2 (high)[0x591] : 0x00080000
PERR [19:19] : 0x1
EMU A[0x501] : 0x20000000
UDT [29:29] : 0x1 Undefined DTransI
Sun Fire 12K/15K systems may Dstop. An examination of the resulting "dsmd.dstop.yymmdd.hhmm.ss" file in the "/var/opt/SUNWSMS/adm/X/dump" directory (where "X" is the domain letter) using the "redx" command will produce output similar to the following example:
redx> wfail
All master SDIs in this dump indicating valid error info [1CC50]
indicate the first error was Dstop for all of EXB EX16.
SDI EX16/S0 Master_Stop_Status0[31:0] = B004008F
MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX16/S0 Dstop0[31:0] = 20098008
Dstop0[16]: D DARB texp requests all Dstop (M)
Dstop0[19]: D 1E SDI internal core requested Dstop
Dstop0[29]: D Slot1 asserted Error, enabled to cause Dstop (M)
SDI EX16/S0 Core_Error0[31:0] = 02008200 Mask = 0051FFFF
CoreErr0[25]: D 1E Command pool timeout, non-split exp (M)
valid_{slot_wr[1:0],read}_TO = 1 (rev 4+)
{cmd_pool_loc[5:0],cmd4io,retired,half_used} = 024
FAIL EXB EX16: Dstop/Rstop detected by SDI EX16/S0.
The FRU for this failure cannot be identified from the available information.
This error is not diagnosable. The FAIL action is just a guess to
satisfy the POST design requirement that something must be
deconfigured after a stop to guarantee that the process terminates.
The FAILed component is no more suspect than any other hardware
in the domain.
DARB C0: enabled ports (expanders) [17:0]: 3FFFF
DARB C0: other darb req Dstop+Rstop for exps[17:0]: 10000
DARB C1: enabled ports (expanders) [17:0]: 3FFFF
DARB C1: other darb req Dstop+Rstop for exps[17:0]: 10000
Workaround
There is no workaround. Please see the "Resolution" section below.
Resolution
This issue is addressed in the following releases:
SPARC Platform
Modification HistoryDate: 28-MAY-2004
-
Updated Contributing Factors and Relief/Workaround sections
Date: 08-JUN-2004
-
Updated Contributing Factors section
Date: 17-JUN-2004
-
Updated Contributing Factors and Relief/Workaround sections
Date: 12-JUL-2004
-
State: Resolved
-
Updated Contributing Factors and Resolution sections
Date: 10-NOV-2005
- Updated Contributing Factors section
AttachmentsThis solution has no attachment