Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays



Category :Availability
Release Phase :Resolved
Product :Sun StorageTek 3510 FC Array
Sun StorageTek 3511 SATA Array  
Bug Id :6321239, 6365819  
Date of Workaround Release :01-DEC-2005 
Date of Resolved Release :15-JUN-2006 


Impact

Upon a controller failure/replacement on a Sun StorEdge 3510/3511 array, all the nodes connected in a Sun Cluster 3.x environment may panic.


Contributing Factors

This issue can occur on the following platforms:

SPARC Platform

  • Sun StorEdge 3510 FC array with firmware version 4.11I/4.13C (as delivered in patch 113723-10/113723-11) and without firmware 4.15F (as delivered in patch 113723-15)
  • Sun StorEdge 3511 FC array with firmware version 4.11I/4.13C (as delivered in patch 113724-04/113724-05) and without firmware 4.15F (as delivered in patch 113724-09)

This issue will only occur in cluster configurations that issue SCSI-2 reservations (for example: 2 node clusters) including:

  • Sun Cluster 3.x on Solaris 8, Solaris 9 and Solaris 10

when LUN filtering is enabled.


Symptoms

Should the described issue occur, all the nodes of the cluster will panic with a Reservation Conflict similar to the following:

    sun50-node:/var/crash/sun50-node #
    panic[cpu19]/thread=2a1001e5d20: Reservation Conflict

    000002a1001e57d0 ssd:ssd_mhd_watch_cb+c4 (3000c318808, 0, 7600000042, 300172f5578, 300172f55a8, 0)
      %l0-3: 000000007829c740 00000300164f3d98 000003000081f2c8 000003000727e320
      %l4-7: 0000030005b9b5e8 0000000000000000 0000000000000002 00000000ff08dffc
    000002a1001e5880 scsi:scsi_watch_request_intr+140 (0, 0, 30015f873c0, 300164f3d98, 0, 300172f5530)
      %l0-3: 000000001034aadc 000003000081f2c8 0000030005b9b5e8 000003002cdc0f30
      %l4-7: 000003000727e320 00000000782a81ac 00000300172f55a8 000003002b73e000
    000002a1001e5950 qlc:qlc_task_thread+698 (300008207f0, 300008207e8, ff00, 300008207f2, 783c3240, 783c3250)
      %l0-3: 000000007829920c 00000000783c3260 00000000783c3270 000000000001ff80
      %l4-7: 0000030000820ac0 000003100b5ef180 00000300008207e8 00000300008207c8
    000002a1001e5a60 qlc:qlc_task_daemon+70 (300008207e8, 300008207c8, 300008207f2, 104640c0, 30000820af8, 30000820afa)
      %l0-3: 0000030000820ae0 00000310002fdb20 0000000000000000 0000000010408000

Note: After the nodes boot following a panic, they will not be able to see the LUNs from the Sun StorEdge 3500/3511 array. Both the nodes will show the drives as "<drive not available: reserved>" when using format(1M).

Only after the Sun StorEdge 3510/3511 array is reset and the nodes are rebooted will everything return to normal.

The following example will show how the controller fw handling of the reservation of the nexus (controller, target, lun) at a LUN level can cause the reservation conflict to happen, when using LUN filtering.

Example from a "show map" output:

    Ch Tgt LUN   ld/lv   ID-Partition  Assigned  Filter Map
    -------------------------------------------------------------------
    0  40     0   ld0    24A193C9-00   Primary   210000E08B13AC6F {HBA-1}
    0  40     1   ld0    24A193C9-02   Primary   210000E08B13AC6F {HBA-1}  <--
    0  40     1   ld0    24A193C9-02   Primary   210000E08B133FC2 {HBA-2}  <--
    0  40     2   ld0    24A193C9-03   Primary   210000E08B13AC63 {HBA-3}  <--
    0  40     2   ld0    24A193C9-05   Primary   210000E08B133FC4 {HBA-4}  <--

Note that LUN #1 is being used for the same partition, 24A193C9-02, to two different initiators/HBAs, {HBA-1} and {HBA-2}.

Note that LUN #2 is being used for the different partitions, 24A193C9-03 and 24A193C9-05, to two different initiators/HBAs, {HBA-4} and {HBA-3}.

During a controller failure/reset, a reservation on one nexus can assert itself to the other nexus with the "same LUN number".

There have been a few cases reported that the process of logical drive partition/repartition can cause the reservation panic. While the issue with controller failure/reset is known, root cause of the partition/repartition issue is still in progress.


Workaround

To work around the described issue, disable LUN filtering and use switch zoning. Instructions for LUN filtering can be found at:

Sun StorEdge 3000 Family CLI 2.x User's Guide at: http://www.sun.com/products-n-solutions/hardware/docs/html/817-4951-14

Sun StorEdge 3000 Family RAID Firmware 4.1x User's Guide at: http://www.sun.com/products-n-solutions/hardware/docs/html/817-3711-14

For Switch zoning consult the corresponding manufacturer documentation.


Resolution

This issue is addressed on the following platforms:

  • Sun StorEdge 3510 FC array with firmware 4.15F (as delivered in patch 113723-15 or later)
  • Sun StorEdge 3511 SATA array with firmware 4.15F (as delivered in patch 113724-09 or later)



Modification History


Date: 12-JAN-2006

12-Jan-2006:

  • Updated Contributing Factors, Relief/Workaround

Date: 25-APR-2006
  • Updated Contributing Factors and Resolution sections

Date: 15-JUN-2006
  • State: Resolved
  • Updated Contributing Factors and Resolution sections



Attachments
This solution has no attachment

 
 
Login Required

You must login and have a valid contract to access Sun's Premium content which includes:

  • Sun Alerts
  • Bugs
  • Patches
  • Solutions
  • White Papers
  • Documentation
  • Support Knowledge

Login Required

You must login and have a valid contract to access Sun's contracted features

Access Legend:

(Login to access)   Sun Contracted Content
(Login to access)   Sun Contracted Feature

Please make use of SunSolve Feedback application by selecting the floating [+] to provide feedback about this specific document.

Search

Article Details
Article ID : 201561
Article Type : Sun Alert
Last reviewed : 2006-06-20
Audience : PUBLIC
Keywords :
Provide feedback  (help)
Page Tools
»  Print This Page
»  Email This Article
»  Bookmark This Article
 
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright Sun Microsystems, Inc. | SunSolve Version 7.4.0 #1