Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays



Category :Data Loss
Release Phase :Resolved
Product :Sun StorageTek 3310 SCSI Array
Sun StorageTek 3510 FC Array
Sun StorageTek 3320 SCSI Array
Sun StorageTek 3511 SATA Array  
Bug Id :6364526  
Date of Workaround Release :12-JAN-2006 
Date of Resolved Release :26-APR-2006 


Impact

In the "Troubleshooting" section (8.5) of the "Sun StorEdge 3000 Family Installation, Operation, and Service Manual," (part number 816-7300-17) instructions for "Recovering From Fatal Drive Failure" are incomplete. Failure to use the correct procedure for this condition (as outlined in the "Workaround" section, step 6) may result in data integrity issues.

Note: Please also see related Sun Alert 102126 - "Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues"


Contributing Factors

This issue can occur on the following platforms:

  • Sun StorEdge 3310 SCSI array
  • Sun StorEdge 3320 SCSI array
  • Sun StorEdge 3510 FC array
  • Sun StorEdge 3511 FC array

for all current releases of controller firmware.

Existing documentation for "Recovering From Fatal Drive Failure" (section 8.5) is incomplete.

This issue refers to incomplete reconstruction of logical drives and logical drives in a "dead" state, or where the logical drive status is "FATAL FAIL," indicating that more than one drive is bad.


Symptoms

The drive(s) may assume the role of valid disk(s) in the logical drive after the reset if a reset is issued to the array before pulling the correct drive(s) in a dead logical drive. (These drives may contain stale data depending on I/O activity to the array).

Section 8.5 of the "Troubleshooting" section of service manual 816-7300-17 describes "Recovering From Fatal Drive Failure" with the following steps:

1. Discontinue all I/O activity immediately

2. Cancel the beeping alarm from the controller firmware's "Main Menu", by choosing "System Functions-Mute Beeper"

3. Physically check that all drives are firmly seated in the array and that none have been partially or completely removed.

4. Look for Status: "FATAL FAIL" (two or more failed drives).

In the firmware "Main Menu", choose "View and Edit Logical drives," and look for:

    Status: FAILED DRV (one failed drive)
    Status: FATAL FAIL (two or more failed drives)

5. Highlight the logical drive, press Return, and choose "View scsi Drives".

If two physical drives have an issue, one drive has a BAD status and one drive has a MISSING status. The MISSING status is a reminder that one of the drives might be a "false" failure (The status does not tell you which drive might be a false failure).

In the next step (6), the controller should be shut down to flush cache followed by a reset of the controller.

Note: The workaround in this Sun Alert details how to identify/pull drives based on the RAID level. These steps are not mentioned in the current documentation.

Before proceeding with Step 6, it is recommended that you pull one or more drives depending on the type of RAID the logical drive is configured with, and if the first drive failure can be determined.

6. From the "Main Menu", choose "System Functions - Shutdown Controller" and then choose "Yes" to confirm that you want to shut down the controller.

The controller can then be reset.


Workaround

A) For a logical drive configured for RAID 3 or 5, if the order of drive failure can be determined in the case of a double drive failure, then use the following steps:

1. Pull the original/first failed drive only (the first failure can be determined via the controller event log.) Also note the other bad drive(s)location.

2. Reset the controller. Use the shutdown controller menu option and choose yes when the "Reset Controller?" prompt is displayed. When the system comes back up, view the logical drives and verify that the "FATAL FAIL" has changed to "degraded".

3. If the logical drive has changed to "degraded," run fsck(1M) or equivalent.

(If the status is still "FATAL FAIL", you might have lost all data on the logical drive, and it might be necessary to re-create the logical drive. Follow step 8 in the Troubleshooting section 8.5 for this case).

4. After the fsck(1M) completes successfully, reinsert the pulled drive OR replace with a new (good) drive if the event log indicates the drive should be replaced.

5. Rebuild the logical drive.

B) For a logical drive configured for RAID 1, if there is only one bad drive in a paired set, pull that drive and proceed to step 2. If both drives in a paired set are failed, then follow these steps:

1. Pull the original/first failed drive in each failed raid 1 pair (the first failure can be determined via the controller event log.) Also note the other bad drive(s)location.

2. Reset the controller. Use the shutdown controller menu option and choose "yes" when "Reset Controller?" prompt is displayed. When the system comes back up, view the logical drives and verify that the "FATAL FAIL" has changed to "degraded."

3. If the logical drive has changed to "degraded" run fsck(1M) or equivalent.

If the status is still "FATAL FAIL", you might have lost all data on the logical drive, and it might be necessary to re-create the logical drive. Follow step 8 in the Troubleshooting section 8.5 for this case.

4. After the fsck(1M) completes successfully, reinsert the pulled drive OR replace with a new (good) drive if the event log indicates the drive should be replaced.

5. Rebuild the logical drive.

C) For a logical drive configured for RAID 5, if there is one missing drive, or multiple failures with only one failure in a paired set, then that drive can be replaced; otherwise, the first missing drive should be replaced only.

If it cannot be determined which drive failed first, then the array should be file system checked after the reset as there may be data inconsistencies.

Note: It is important that you check your recovered data using the application or host-based tools following a "FATAL FAIL" recovery.


Resolution

A final resolution has been completed with the updated instructions in the "Sun StorEdge 3000 Family Installation, Operation, and Service Manual" (PN: 816-7300-19 or later) at http://docs.sun.com/app/docs?q=7300-19.




Modification History


Date: 26-APR-2006

26-Apr-2006:

  • Updated Contributing Factors and Resolution Sections



Attachments
This solution has no attachment

 
 
Login Required

You must login and have a valid contract to access Sun's Premium content which includes:

  • Sun Alerts
  • Bugs
  • Patches
  • Solutions
  • White Papers
  • Documentation
  • Support Knowledge

Login Required

You must login and have a valid contract to access Sun's contracted features

Access Legend:

(Login to access)   Sun Contracted Content
(Login to access)   Sun Contracted Feature

Please make use of SunSolve Feedback application by selecting the floating [+] to provide feedback about this specific document.

Search

Article Details
Article ID : 200491
Article Type : Sun Alert
Last reviewed : 2006-05-02
Audience : PUBLIC
Keywords :
Provide feedback  (help)
Page Tools
»  Print This Page
»  Email This Article
»  Bookmark This Article
 
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright Sun Microsystems, Inc. | SunSolve Version 7.4.0 #1