Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays |
|
| Category : | Data Loss |
| Release Phase : | Resolved |
| Product : | Sun StorageTek 3310 SCSI Array Sun StorageTek 3510 FC Array Sun StorageTek 3320 SCSI Array Sun StorageTek 3511 SATA Array
|
| Bug Id : | 6364526
|
| Date of Workaround Release : | 12-JAN-2006
|
| Date of Resolved Release : | 26-APR-2006
|
Impact
In the "Troubleshooting" section (8.5) of the "Sun StorEdge 3000 Family Installation, Operation, and Service Manual," (part number 816-7300-17) instructions for "Recovering From Fatal Drive Failure" are incomplete. Failure to use the correct procedure for this condition (as outlined in the "Workaround" section, step 6) may result in data integrity issues.
Note: Please also see related Sun Alert 102126 - "Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues"
Contributing Factors
This issue can occur on the following platforms:
- Sun StorEdge 3310 SCSI array
- Sun StorEdge 3320 SCSI array
- Sun StorEdge 3510 FC array
- Sun StorEdge 3511 FC array
for all current releases of controller firmware.
Existing documentation for "Recovering From Fatal Drive Failure" (section 8.5) is incomplete.
This issue refers to incomplete reconstruction of logical drives and logical drives in a "dead" state, or where the logical drive status is "FATAL FAIL," indicating that more than one drive is bad.
Symptoms
The drive(s) may assume the role of valid disk(s) in the logical drive after the reset if a reset is issued to the array before pulling the correct drive(s) in a dead logical drive. (These drives may contain stale data depending on I/O activity to the array).
Section 8.5 of the "Troubleshooting" section of service manual 816-7300-17 describes "Recovering From Fatal Drive Failure" with the following steps:
1. Discontinue all I/O activity immediately
2. Cancel the beeping alarm from the controller firmware's "Main Menu", by choosing "System Functions-Mute Beeper"
3. Physically check that all drives are firmly seated in the array and that none have been partially or completely removed.
4. Look for Status: "FATAL FAIL" (two or more failed drives).
In the firmware "Main Menu", choose "View and Edit Logical drives," and look for:
Status: FAILED DRV (one failed drive)
Status: FATAL FAIL (two or more failed drives)
5. Highlight the logical drive, press Return, and choose "View scsi Drives".
If two physical drives have an issue, one drive has a BAD status and one drive has a MISSING status. The MISSING status is a reminder that one of the drives might be a "false" failure (The status does not tell you which drive might be a false failure).
In the next step (6), the controller should be shut down to flush cache followed by a reset of the controller.
Note: The workaround in this Sun Alert details how to identify/pull drives based on the RAID level. These steps are not mentioned in the current documentation.
Before proceeding with Step 6, it is recommended that you pull one or more drives depending on the type of RAID the logical drive is configured with, and if the first drive failure can be determined.
6. From the "Main Menu", choose "System Functions - Shutdown Controller" and then choose "Yes" to confirm that you want to shut down the controller.
The controller can then be reset.
Workaround
A) For a logical drive configured for RAID 3 or 5, if the order of drive failure can be determined in the case of a double drive failure, then use the following steps:
1. Pull the original/first failed drive only (the first failure can be determined via the controller event log.) Also note the other bad drive(s)location.
2. Reset the controller. Use the shutdown controller menu option and choose yes when the "Reset Controller?" prompt is displayed. When the system comes back up, view the logical drives and verify that the "FATAL FAIL" has changed to "degraded".
3. If the logical drive has changed to "degraded," run fsck(1M) or equivalent.
(If the status is still "FATAL FAIL", you might have lost all data on the logical drive, and it might be necessary to re-create the logical drive. Follow step 8 in the Troubleshooting section 8.5 for this case).
4. After the fsck(1M) completes successfully, reinsert the pulled drive OR replace with a new (good) drive if the event log indicates the drive should be replaced.
5. Rebuild the logical drive.
B) For a logical drive configured for RAID 1, if there is only one bad drive in a paired set, pull that drive and proceed to step 2. If both drives in a paired set are failed, then follow these steps:
1. Pull the original/first failed drive in each failed raid 1 pair (the first failure can be determined via the controller event log.) Also note the other bad drive(s)location.
2. Reset the controller. Use the shutdown controller menu option and choose "yes" when "Reset Controller?" prompt is displayed. When the system comes back up, view the logical drives and verify that the "FATAL FAIL" has changed to "degraded."
3. If the logical drive has changed to "degraded" run fsck(1M) or equivalent.
If the status is still "FATAL FAIL", you might have lost all data on the logical drive, and it might be necessary to re-create the logical drive. Follow step 8 in the Troubleshooting section 8.5 for this case.
4. After the fsck(1M) completes successfully, reinsert the pulled drive OR replace with a new (good) drive if the event log indicates the drive should be replaced.
5. Rebuild the logical drive.
C) For a logical drive configured for RAID 5, if there is one missing drive, or multiple failures with only one failure in a paired set, then that drive can be replaced; otherwise, the first missing drive should be replaced only.
If it cannot be determined which drive failed first, then the array should be file system checked after the reset as there may be data inconsistencies.
Note: It is important that you check your recovered data using the application or host-based tools following a "FATAL FAIL" recovery.
Resolution
A final resolution has been completed with the updated instructions in the "Sun StorEdge 3000 Family Installation, Operation, and Service Manual" (PN: 816-7300-19 or later) at http://docs.sun.com/app/docs?q=7300-19.
Modification HistoryDate: 26-APR-2006
26-Apr-2006:
- Updated Contributing Factors and Resolution Sections
AttachmentsThis solution has no attachment