Sun Systems Equipped With Schizo ASICs Version 2.3 or Higher May Experience Either Domain Stop (Dstop), Domain Pause or FATAL RESET Under Heavy I/O



Category :Availability
Release Phase :Resolved
Product :Solaris 9 Operating System
Solaris 8 Operating System  
Bug Id :4897386  
Date of Workaround Release :13-APR-2004 
Date of Resolved Release :12-JUL-2004 


Impact

Sun systems with Schizo ASIC version 2.3 or higher may encounter Domain Stop (Dstop) on high end servers (Sun Fire 12K/15K/20K/25K), Domain Pause on Sun Fire 6800 or Fatal Reset on desktop servers (Sun Fire 280R, V480/V490/V880/V890, and Netra 20) during heavy I/O activities.


Contributing Factors

This issue can occur in the following releases:

SPARC Platform

on the following systems which have Schizo ASIC 2.3 or higher:

  • Sun Fire 12K/15K/20K/25K
  • Sun Fire 6800 with Sun Fire Link
  • Sun Fire V480/V490
  • Sun Fire 280R
  • Sun Fire V880/V890
  • Netra 20
  • Sun Blade 1000/2000

The described issue may occur on Schizo versions 2.3 or higher.

To determine if a system is equipped with Schizo ASIC (and the version), use the following command:

For Sun Fire 12K/15K/20K/25K servers:

    % redx -x "shioc 5 1 0" | grep Component    
   PCI IOC IO05/P0   Component ID = 107BE06D    TO_2.3

In the "shioc 5 1 0" section of the command, the "5" indicates the expander, the "1" is the slot (always a 1 for I/O boards), the "0" is the Schizo and the "TO_2.3" indicates Schizo version 2.3.

Note: Using "redx" on a production system needs to be done carefully. Incorrect usage can cause a loss of availability.

For Sun Fire 6800 systems:

    # /usr/platform/sun4u/sbin/prtdiag -v
                                 Port
    FRU Name    Model            ID  Status Version
    /N0/IB6/P0  SUNW,schizo      24   ok     4 <--Schizo 2.2
    /N0/IB6/P1  SUNW,schizo      25   ok     4
    /N0/IB9/P0  SUNW,schizo      30   ok     5 <--Schizo 2.3
    /N0/IB6/P0  SUNW,sgsbbc      24   ok     2
    /N0/IB9/P0  SUNW,sgsbbc      30   ok 

Use the following reference to determine the Schizo ASIC version:

    Version        Schizo Rev
       5             2.3
       4             2.2

Note 1: Only Sun Fire 6800 systems with Sun Fire Link utilize Schizo ASIC version 2.3.

Note2 : The above command will need to be executed in each domain.

For Sun Fire 280R, V480/V490, V880/V890 and Netra 20 systems:

    # /usr/platform/sun4u/sbin/prtdiag -v
  
    IO ASIC revisions:
                             Port
    Brd  Model            ID  Status Version
    IB-1 unknown           8    ok     4
    IB-1 unknown           9    ok     4

Use the following reference to determine the Schizo ASIC version:

    Version        Schizo Rev
       7             2.5
       6             2.4
       5             2.3
       4             2.2

Symptoms

Should the described issue occur, Sun Fire 280R, V480, V880 and Netra 20 systems may encounter a FATAL RESET. The reset output can only be seen via attaching console logging to ttya. The only indication a system's OS may have when experiencing a FATAL RESET is a one line message on reboot stating "System booting after fatal error FATAL" which is seen in the "/var/adm/messages " file.

The example below is the partial console output of a FATAL RESET. It has been edited (reduced) for the purpose of fitting within this SunAlert.

    ERROR: System "FATAL RESET" from  CPU0 CPU2 CPU3 CPU5 CPU6 CPU7
    System State (CPU7 reporting)

    CPU1 Config/Control/Status registers: 

    CPUVersion:  003e.0015.2200.0507
    SafConfig:   0caa.01bc.0003.0002
    SafBaseAdr:  0000.0400.0080.0000
    DCacheCtl:   0000.0000.0000.0000
    ECacheCtl:   0000.0000.0329.4400
    ECErrEnable: 0000.0000.0000.000b

    AFAR:        0000.00d3.fe82.f990
    AFSR:        0008.0000.0000.0000 PERR
    AFAR 2:      0000.00d3.fe82.f990
    AFSR 2:      0008.0000.0000.0000 PERR

    DMMU SFAR:   0000.0003.810b.0e70
    DMMU SFSR:   0000.0000.0080.8000 TM
    IMMU SFSR:   0000.0000.0080.8000 TM

When FATAL RESETs are experienced important information needs to be gathered. The system configuration needs to be known. It's imperative to know what the system was doing at the time of the failure. This issue has not been seen when a system is idle. The console log needs to be gathered to decode the FATAL RESET output to help determine what the cause is.

Note: FATAL RESETs are caused by many different contributors from hardware to software and it should not be assumed that all FATAL RESETs are caused by this issue. This needs to be determined or ruled out before going forward.

A Domain Pause on an affected Sun Fire 6800 would look similar to the following in the showlogs output:

    Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 920266 local0.crit] 
    ErrorMonitor: Domain A has a SYSTEM ERROR
    Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 920266 local0.crit] 
    ErrorMonitor: Domain A has a SYSTEM ERROR
    Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 137548 local0.error] 
    /N0/SB0 encountered the first error
    Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 329171 local0.error] 
    RepeaterSbbcAsic reported first error on /N0/SB0
    Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 172596 local0.error]
    /SB0/bbcGroup1/sbbc1:
    >>> ErrorStatus[0x80] : 0x80008200
                   FE [15:15] : 0x1
                   ErrSum [31:31] : 0x1
                   SafErr [09:08] : 0x2 Fireplane device asserted an error

    Sat Jan 17 02:17:51 noname-sc0 Domain-B.SC: [ID 885793 local0.error]
    /SB0/bbcGroup1/cpuCD/cpusafariagent1:
         AFAR (high)[0x531] : 0x0000000c
             AFAR [42:32] [10:00] : 0xc
         AFAR (low)[0x541] : 0x06000c00
         AFAR_2 (high)[0x571] : 0x0000000c
           AFAR_2 [42:32] [10:00] : 0xc
         AFAR_2 (low)[0x581] : 0x06000c00
         AFSR (high)[0x551] : 0x00080000
             PERR [19:19] : 0x1
         AFSR_2 (high)[0x591] : 0x00080000
             PERR [19:19] : 0x1
         EMU A[0x501] : 0x20000000
             UDT [29:29] : 0x1 Undefined DTransI

Sun Fire 12K/15K systems may Dstop. An examination of the resulting "dsmd.dstop.yymmdd.hhmm.ss" file in the "/var/opt/SUNWSMS/adm/X/dump" directory (where "X" is the domain letter) using the "redx" command will produce output similar to the following example:

    redx> wfail
    All master SDIs in this dump indicating valid error info [1CC50]
    indicate the first error was Dstop for all of EXB EX16.
    SDI EX16/S0  Master_Stop_Status0[31:0] = B004008F
        MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
    SDI EX16/S0  Dstop0[31:0] = 20098008
        Dstop0[16]: D    DARB texp requests all Dstop (M)
        Dstop0[19]: D 1E SDI internal core requested Dstop
        Dstop0[29]: D    Slot1 asserted Error, enabled to cause Dstop (M)
    SDI EX16/S0  Core_Error0[31:0]  = 02008200  Mask = 0051FFFF
        CoreErr0[25]: D 1E Command pool timeout, non-split exp (M)
            valid_{slot_wr[1:0],read}_TO = 1 (rev 4+)
            {cmd_pool_loc[5:0],cmd4io,retired,half_used} = 024
    FAIL EXB EX16:  Dstop/Rstop detected by SDI EX16/S0.
    The FRU for this failure cannot be identified from the available information.
        This error is not diagnosable. The FAIL action is just a guess to
        satisfy the POST design requirement that something must be
        deconfigured after a stop to guarantee that the process terminates.
        The FAILed component is no more suspect than any other hardware
        in the domain.
    DARB C0: enabled ports (expanders)          [17:0]: 3FFFF
    DARB C0: other darb req Dstop+Rstop for exps[17:0]: 10000
    DARB C1: enabled ports (expanders)          [17:0]: 3FFFF
    DARB C1: other darb req Dstop+Rstop for exps[17:0]: 10000

Workaround

There is no workaround. Please see the "Resolution" section below.


Resolution

This issue is addressed in the following releases:

SPARC Platform




Modification History


Date: 28-MAY-2004
  • Updated Contributing Factors and Relief/Workaround sections

Date: 08-JUN-2004
  • Updated Contributing Factors section

Date: 17-JUN-2004
  • Updated Contributing Factors and Relief/Workaround sections

Date: 12-JUL-2004
  • State: Resolved
  • Updated Contributing Factors and Resolution sections

Date: 10-NOV-2005
  • Updated Contributing Factors section



Attachments
This solution has no attachment

 
 
Login Required

You must login and have a valid contract to access Sun's Premium content which includes:

  • Sun Alerts
  • Bugs
  • Patches
  • Solutions
  • White Papers
  • Documentation
  • Support Knowledge

Login Required

You must login and have a valid contract to access Sun's contracted features

Access Legend:

(Login to access)   Sun Contracted Content
(Login to access)   Sun Contracted Feature

Please make use of SunSolve Feedback application by selecting the floating [+] to provide feedback about this specific document.

Search

Article Details
Article ID : 200502
Article Type : Sun Alert
Last reviewed : 2005-11-10
Audience : PUBLIC
Keywords :
Provide feedback  (help)
Page Tools
»  Print This Page
»  Email This Article
»  Bookmark This Article