Disks May be Marked as Bad Without Explanation After "Drive Failure," "Media Scan Failed" or "Clone Failed" Events



Category :Data Loss
Release Phase :Resolved
Product :Sun StorageTek 3310 SCSI Array
Sun StorageTek 3510 FC Array
Sun StorageTek 3320 SCSI Array
Sun StorageTek 3511 SATA Array  
Bug Id :6346306  
Date of Workaround Release :12-JAN-2006 
Date of Resolved Release :15-JUN-2006 


Impact

On rare occasions, Sun StorEdge 3310, 3320, 3510 or 3511 arrays running current 4.x firmware versions, may report 1 (or, exceptionally, more than 1) disk drive as "bad" with no warning or explanation other than a "Drive Failure," "Media Scan Failed" or "Clone Failed" message. If such an event affecting a single disk drive is not noticed quickly by the system administrator, (especially in array configurations without a "hot spare" disk drive assigned), then the array is susceptible to data loss if any other disk drive is similarly reported as "bad" by the array controller.

In exceptionally rare cases, more than 1 disk drive might be reported as "bad" within a short space of time. Since most RAID configurations cannot cope with the loss of more than 1 disk from an LD (Logical Drive) within a short space of time (i.e. before reconstruction to a hot spare has completed), then more than 1 disk drive being marked as "bad" within a single LD could lead to data loss, if that LD transitions to a status of "Fatal Fail."


Contributing Factors

This issue can occur on the following platforms:

  • Sun StorEdge 3310 SCSI array without firmware 4.15F (as delivered in patch 113722-15)
  • Sun StorEdge 3320 SCSI array without firmware 4.15G (as delivered in patch 113730-01)
  • Sun StorEdge 3510 FC array without firmware 4.15F (as delivered in patch 113723-15)
  • Sun StorEdge 3511 SATA array without firmware 4.15F (as delivered in patch 113724-09)

for all current 4.x releases of controller firmware.

Note: This behavior has not been seen with earlier 3.x firmware revisions.


Symptoms

Any array controller event which shows "Drive Failure", "Media Scan Failed" or "Clone Failed" without any immediately preceding warning messages for the disk drive mentioned in that event, is an occurrence of this issue, as in the following example:

    Sun Dec 11 16:10:17 2005
    [Primary]       Notification
    NOTICE: Media Scan of CHL:2 ID:8 Completed

    Tue Dec 13 01:27:42 2005
    [Primary]       Alert
    LG:1 Logical Drive ALERT: CHL:2 ID:7  Drive Failure

    Tue Dec 13 01:27:45 2005
    [Primary]       Notification
    LG:1 Logical Drive NOTICE: Starting Rebuild

    Tue Dec 13 10:04:21 2005
    [Primary]       Notification
    Rebuild of Logical Drive 1 Completed

In the above example, there were no warning messages between Sun Dec 11 16:10:17 and Tue Dec 13 01:27:42, when the disk drive at ID:7 was reported as "Drive Failure," and there is no explanation about why the drive was marked as "bad" by the controller. In this case (as in most cases), only a single disk was reported as "Drive Failure." This array had a hot spare disk drive configured, and so the rebuild started immediately, and completed some hours later as expected.

In the worst case, a series of events like the following example may be seen, leading to data loss:

    Wed Dec  7 09:30:09 2005
    [Primary]       Notification
    On-Line Initialization of Logical Drive 1 Completed

    Thu Dec 15 13:59:48 2005
    [Primary]       Alert
    LG:0 Logical Drive ALERT: CHL:2 ID:0  Drive Failure

    Thu Dec 15 13:59:51 2005
    [Primary]       Notification
    LG:0 Logical Drive NOTICE: Starting Rebuild

    Thu Dec 15 14:00:29 2005
    [Primary]       Alert
    LG:0 Logical Drive ALERT: CHL:2 ID:2  Drive Failure

    Thu Dec 15 14:00:29 2005
    [Primary]       Alert
    LG:0 Logical Drive ALERT: Rebuild Failed

Again, there had been several days (8 days in this case) without error messages. Then, at Thu Dec 15 13:59:48, the disk drive at ID:0 is reported as "Drive Failure" without further explanation. The array had a "hot spare" disk drive configured, and the rebuild started as expected. However, at Thu Dec 15 14:00:29, a second disk drive at ID:2 is also reported as "Drive Failure". Since the rebuild onto the "hot spare" disk drive had not yet completed, when the second disk was reported as "Drive Failure", LD:0 has lost 2 disk drives (IDs 0 and 2) from its RAID-5 configuration, and hence did not have enough redundancy to continue. Host access to that LD was lost.

For comparison purposes, this "Drive Failure" event is not an example of this issue, as the events logged before the "Drive Failure" event clearly show that the disk drive had a genuine fault:

    Sun Dec  6 17:20:19 2005
    [Primary]     Warning
    CHL:2 ID:0  Drive ALERT: Drive HW Error (4C4)

    Sun Dec  6 17:20:19 2005
    [Primary]     Warning
    CHL:2 ID:0  Drive ALERT: Drive HW Error (4C4)

    Sun Dec  6 17:20:19 2005
    [Primary]       Alert
    LG:1 Logical Drive ALERT: CHL:2 ID:0  Drive Failure

Workaround

There is currently no way to predict, or prevent, a disk drive being marked as "bad" by the array controller with one of the messages "Drive Failure", "Media Scan Failed" or "Clone Failed," and with no immediately preceding messages to explain or justify why that disk drive has been marked as "bad". Therefore until further notice, treat these events as genuine disk drive failures (as some of them probably are, although others may not be).


Resolution

This issue is addressed on the following platforms:

  • Sun StorEdge 3310 SCSI array with firmware 4.15F (as delivered in patch 113722-15 or later)
  • Sun StorEdge 3320 SCSI array with firmware 4.15G (as delivered in patch 113730-01 or later)
  • Sun StorEdge 3510 FC array with firmware 4.15F (as delivered in patch 113723-15 or later)
  • Sun StorEdge 3511 SATA array with firmware 4.15F (as delivered in patch 113724-09 or later)  



Modification History


Date: 25-APR-2006

25-Apr-2006:

  • Updated Contributing Factors and Resolution sections

Date: 15-JUN-2006

15-Jun-2006:

  • Updated Contributing Factors and Resolution sections



Attachments
This solution has no attachment

 
 
Login Required

You must login and have a valid contract to access Sun's Premium content which includes:

  • Sun Alerts
  • Bugs
  • Patches
  • Solutions
  • White Papers
  • Documentation
  • Support Knowledge

Login Required

You must login and have a valid contract to access Sun's contracted features

Access Legend:

(Login to access)   Sun Contracted Content
(Login to access)   Sun Contracted Feature

Please make use of SunSolve Feedback application by selecting the floating [+] to provide feedback about this specific document.

Search

Article Details
Article ID : 201137
Article Type : Sun Alert
Last reviewed : 2006-06-15
Audience : PUBLIC
Keywords :
Provide feedback  (help)
Page Tools
»  Print This Page
»  Email This Article
»  Bookmark This Article
 
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright Sun Microsystems, Inc. | SunSolve Version 7.4.0 #1