Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays |
|
| Category : | Availability |
| Release Phase : | Resolved |
| Product : | Sun StorageTek 3510 FC Array Sun StorageTek 3511 SATA Array
|
| Bug Id : | 6321239, 6365819
|
| Date of Workaround Release : | 01-DEC-2005
|
| Date of Resolved Release : | 15-JUN-2006
|
Impact
Upon a controller failure/replacement on a Sun StorEdge 3510/3511 array, all the nodes connected in a Sun Cluster 3.x environment may panic.
Contributing Factors
This issue can occur on the following platforms:
SPARC Platform
- Sun StorEdge 3510 FC array with firmware version 4.11I/4.13C (as delivered in patch 113723-10/113723-11) and without firmware 4.15F (as delivered in patch 113723-15)
- Sun StorEdge 3511 FC array with firmware version 4.11I/4.13C (as delivered in patch 113724-04/113724-05) and without firmware 4.15F (as delivered in patch 113724-09)
This issue will only occur in cluster configurations that issue SCSI-2 reservations (for example: 2 node clusters) including:
- Sun Cluster 3.x on Solaris 8, Solaris 9 and Solaris 10
when LUN filtering is enabled.
Symptoms
Should the described issue occur, all the nodes of the cluster will panic with a Reservation Conflict similar to the following:
sun50-node:/var/crash/sun50-node #
panic[cpu19]/thread=2a1001e5d20: Reservation Conflict
000002a1001e57d0 ssd:ssd_mhd_watch_cb+c4 (3000c318808, 0, 7600000042, 300172f5578, 300172f55a8, 0)
%l0-3: 000000007829c740 00000300164f3d98 000003000081f2c8 000003000727e320
%l4-7: 0000030005b9b5e8 0000000000000000 0000000000000002 00000000ff08dffc
000002a1001e5880 scsi:scsi_watch_request_intr+140 (0, 0, 30015f873c0, 300164f3d98, 0, 300172f5530)
%l0-3: 000000001034aadc 000003000081f2c8 0000030005b9b5e8 000003002cdc0f30
%l4-7: 000003000727e320 00000000782a81ac 00000300172f55a8 000003002b73e000
000002a1001e5950 qlc:qlc_task_thread+698 (300008207f0, 300008207e8, ff00, 300008207f2, 783c3240, 783c3250)
%l0-3: 000000007829920c 00000000783c3260 00000000783c3270 000000000001ff80
%l4-7: 0000030000820ac0 000003100b5ef180 00000300008207e8 00000300008207c8
000002a1001e5a60 qlc:qlc_task_daemon+70 (300008207e8, 300008207c8, 300008207f2, 104640c0, 30000820af8, 30000820afa)
%l0-3: 0000030000820ae0 00000310002fdb20 0000000000000000 0000000010408000
Note: After the nodes boot following a panic, they will not be able to see the LUNs from the Sun StorEdge 3500/3511 array. Both the nodes will show the drives as "<drive not available: reserved>" when using format(1M).
Only after the Sun StorEdge 3510/3511 array is reset and the nodes are rebooted will everything return to normal.
The following example will show how the controller fw handling of the reservation of the nexus (controller, target, lun) at a LUN level can cause the reservation conflict to happen, when using LUN filtering.
Example from a "show map" output:
Ch Tgt LUN ld/lv ID-Partition Assigned Filter Map
-------------------------------------------------------------------
0 40 0 ld0 24A193C9-00 Primary 210000E08B13AC6F {HBA-1}
0 40 1 ld0 24A193C9-02 Primary 210000E08B13AC6F {HBA-1} <--
0 40 1 ld0 24A193C9-02 Primary 210000E08B133FC2 {HBA-2} <--
0 40 2 ld0 24A193C9-03 Primary 210000E08B13AC63 {HBA-3} <--
0 40 2 ld0 24A193C9-05 Primary 210000E08B133FC4 {HBA-4} <--
Note that LUN #1 is being used for the same partition, 24A193C9-02, to two different initiators/HBAs, {HBA-1} and {HBA-2}.
Note that LUN #2 is being used for the different partitions, 24A193C9-03 and 24A193C9-05, to two different initiators/HBAs, {HBA-4} and {HBA-3}.
During a controller failure/reset, a reservation on one nexus can assert itself to the other nexus with the "same LUN number".
There have been a few cases reported that the process of logical drive partition/repartition can cause the reservation panic. While the issue with controller failure/reset is known, root cause of the partition/repartition issue is still in progress.
Workaround
To work around the described issue, disable LUN filtering and use switch zoning. Instructions for LUN filtering can be found at:
Sun StorEdge 3000 Family CLI 2.x User's Guide at: http://www.sun.com/products-n-solutions/hardware/docs/html/817-4951-14
Sun StorEdge 3000 Family RAID Firmware 4.1x User's Guide at: http://www.sun.com/products-n-solutions/hardware/docs/html/817-3711-14
For Switch zoning consult the corresponding manufacturer documentation.
Resolution
This issue is addressed on the following platforms:
- Sun StorEdge 3510 FC array with firmware 4.15F (as delivered in patch 113723-15 or later)
- Sun StorEdge 3511 SATA array with firmware 4.15F (as delivered in patch 113724-09 or later)
Modification HistoryDate: 12-JAN-2006
12-Jan-2006:
- Updated Contributing Factors, Relief/Workaround
Date: 25-APR-2006
- Updated Contributing Factors and Resolution sections
Date: 15-JUN-2006
- State: Resolved
- Updated Contributing Factors and Resolution sections
AttachmentsThis solution has no attachment