Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues



Category :Data Loss
Release Phase :Resolved
Product :Sun StorageTek 3310 SCSI Array
Sun StorageTek 3510 FC Array
Sun StorageTek 3320 SCSI Array
Sun StorageTek 3511 SATA Array  
Bug Id :5095223  
Date of Workaround Release :12-JAN-2006 
Date of Resolved Release :15-JUN-2006 


Impact

The "Sun StorEdge 3000 Family Installation, Operation, and Service Manual - Sun StorEdge 3510 FC Array" states in Section 8.5 "Recovering From Fatal Drive Failure" that you can recover from a "Status: FATAL FAIL" (two or more failed drives) by simply resetting the controller or powering off the array. This behavior can lead to data integrity issues. Due to the current internal resource handling, all cached data (including uncommitted write data) for a logical drive is discarded if and when the logical drive enters a "FATAL FAIL" state.

In the event of a fatally failed logical drive (more than 2 drive failures in a RAID 3 or 5), the current recovery process is to reset the controller, thereby causing one of the failed drives to be included back into the logical drive changing the Logical Drive state to "Degraded". If a global spare is assigned, the Logical Drive will rebuild. If a global spare is not assigned, the user can assign a spare and rebuild the logical drive. If there were incomplete write operations at the time of a drive failure, this procedure could create inconsistent data.

The "Sun StorEdge 3000 Family Installation, Operation, and Service Manual" (part number 816-7300-17) can be found on docs.sun.com at http://docs.sun.com/app/docs?q=7300-17

Note: Please also see related Sun Alert 102098 - "Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays"


Contributing Factors

This issue can occur on the following platforms:

  • Sun StorEdge 3310 SCSI array without firmware 4.15F (as delivered in patch 113722-15)
  • Sun StorEdge 3320 SCSI array without firmware 4.15G (as delivered in patch 113730-01)
  • Sun StorEdge 3510 FC array without firmware 4.15F (as delivered in patch 113723-15)
  • Sun StorEdge 3511 SATA array without firmware 4.15F (as delivered in patch 113724-09)

for all current releases of controller firmware.


Symptoms

The first drive to fail in a logical disk will be persistently marked as BAD while subsequent drives that fail (while the first drive has not been fully reconstructed) will be marked as MISSING temporarily. If the multiple drives are members of the same parity group, then the owning logical device is marked as "FATAL FAIL," and any existing uncommitted write data is discarded in order to recover data cache resources.

Upon reset, the array will attempt to recover MISSING drives automatically, and if possible, will restore the logical drive to "Degraded" status.

The logical drive is restored, if possible, whether or not any uncommitted write data was discarded. The exposure window is mainly centered on whole site power outages that occur after the secondary drive failure, which would allow user applications to be automatically restarted coincidentally in conjunction with an array reset. This situation increases the probability that the server/application might ignore the logical drive going away and then returning with stale data.


Workaround

The risk of data loss can minimized by ensuring that an unused hot spare is available and/or that the first failed drive is replaced as soon as possible. This ensures that the rebuild process can start and finish as soon as possible, and reduces the exposure window as much as possible.

Unmapping the logical drive while it is in the "FATAL FAIL" state should prevent any hosts from attempting to make use of the logical drive automatically after a reset.

It is recommended that if the logical drive is recovered from a "FATAL FAIL" state, that the application(s) that make use of the logical drive run the appropriate data integrity verification utility before making use of the logical drive (i.e. fsck, chkdsk, etc).

Note: A clean filesystem check will only guarantee the filesystem structure and does NOT guarantee user data validity.

The proper use of data integrity features offered by modern databases, file systems and other applications will help ensure that user applications catch any potential data loss and can take higher level recovery procedures, thereby minimizing the effects.


Resolution

This issue is addressed on the following platforms:

  • Sun StorEdge 3310 SCSI array with firmware 4.15F (as delivered in patch 113722-15 or later)
  • Sun StorEdge 3320 SCSI array with firmware 4.15G (as delivered in patch 113730-01 or later)
  • Sun StorEdge 3510 FC array with firmware 4.15F (as delivered in patch 113723-15 or later)
  • Sun StorEdge 3511 SATA array with firmware 4.15F (as delivered in patch 113724-09 or later)



Modification History


Date: 25-APR-2006

25-Apr-2006:

  • Updated Contributing Factors and Resolution sections

Date: 15-JUN-2006

15-Jun-2006:

  • Updated Contributing Factors and Resolution sections



Attachments
This solution has no attachment

 
 
Login Required

You must login and have a valid contract to access Sun's Premium content which includes:

  • Sun Alerts
  • Bugs
  • Patches
  • Solutions
  • White Papers
  • Documentation
  • Support Knowledge

Login Required

You must login and have a valid contract to access Sun's contracted features

Access Legend:

(Login to access)   Sun Contracted Content
(Login to access)   Sun Contracted Feature

Please make use of SunSolve Feedback application by selecting the floating [+] to provide feedback about this specific document.

Search

Article Details
Article ID : 200021
Article Type : Sun Alert
Last reviewed : 2006-06-15
Audience : PUBLIC
Keywords :
Provide feedback  (help)
Page Tools
»  Print This Page
»  Email This Article
»  Bookmark This Article
 
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright Sun Microsystems, Inc. | SunSolve Version 7.4.0 #1