Disks May be Marked as Bad Without Explanation After "Drive Failure," "Media Scan Failed" or "Clone Failed" Events |
|
| Category : | Data Loss |
| Release Phase : | Resolved |
| Product : | Sun StorageTek 3310 SCSI Array Sun StorageTek 3510 FC Array Sun StorageTek 3320 SCSI Array Sun StorageTek 3511 SATA Array
|
| Bug Id : | 6346306
|
| Date of Workaround Release : | 12-JAN-2006
|
| Date of Resolved Release : | 15-JUN-2006
|
Impact
On rare occasions, Sun StorEdge 3310, 3320, 3510 or 3511 arrays running current 4.x firmware versions, may report 1 (or, exceptionally, more than 1) disk drive as "bad" with no warning or explanation other than a "Drive Failure," "Media Scan Failed" or "Clone Failed" message. If such an event affecting a single disk drive is not noticed quickly by the system administrator, (especially in array configurations without a "hot spare" disk drive assigned), then the array is susceptible to data loss if any other disk drive is similarly reported as "bad" by the array controller.
In exceptionally rare cases, more than 1 disk drive might be reported as "bad" within a short space of time. Since most RAID configurations cannot cope with the loss of more than 1 disk from an LD (Logical Drive) within a short space of time (i.e. before reconstruction to a hot spare has completed), then more than 1 disk drive being marked as "bad" within a single LD could lead to data loss, if that LD transitions to a status of "Fatal Fail."
Contributing Factors
This issue can occur on the following platforms:
- Sun StorEdge 3310 SCSI array without firmware 4.15F (as delivered in patch 113722-15)
- Sun StorEdge 3320 SCSI array without firmware 4.15G (as delivered in patch 113730-01)
- Sun StorEdge 3510 FC array without firmware 4.15F (as delivered in patch 113723-15)
- Sun StorEdge 3511 SATA array without firmware 4.15F (as delivered in patch 113724-09)
for all current 4.x releases of controller firmware.
Note: This behavior has not been seen with earlier 3.x firmware revisions.
Symptoms
Any array controller event which shows "Drive Failure", "Media Scan Failed" or "Clone Failed" without any immediately preceding warning messages for the disk drive mentioned in that event, is an occurrence of this issue, as in the following example:
Sun Dec 11 16:10:17 2005
[Primary] Notification
NOTICE: Media Scan of CHL:2 ID:8 Completed
Tue Dec 13 01:27:42 2005
[Primary] Alert
LG:1 Logical Drive ALERT: CHL:2 ID:7 Drive Failure
Tue Dec 13 01:27:45 2005
[Primary] Notification
LG:1 Logical Drive NOTICE: Starting Rebuild
Tue Dec 13 10:04:21 2005
[Primary] Notification
Rebuild of Logical Drive 1 Completed
In the above example, there were no warning messages between Sun Dec 11 16:10:17 and Tue Dec 13 01:27:42, when the disk drive at ID:7 was reported as "Drive Failure," and there is no explanation about why the drive was marked as "bad" by the controller. In this case (as in most cases), only a single disk was reported as "Drive Failure." This array had a hot spare disk drive configured, and so the rebuild started immediately, and completed some hours later as expected.
In the worst case, a series of events like the following example may be seen, leading to data loss:
Wed Dec 7 09:30:09 2005
[Primary] Notification
On-Line Initialization of Logical Drive 1 Completed
Thu Dec 15 13:59:48 2005
[Primary] Alert
LG:0 Logical Drive ALERT: CHL:2 ID:0 Drive Failure
Thu Dec 15 13:59:51 2005
[Primary] Notification
LG:0 Logical Drive NOTICE: Starting Rebuild
Thu Dec 15 14:00:29 2005
[Primary] Alert
LG:0 Logical Drive ALERT: CHL:2 ID:2 Drive Failure
Thu Dec 15 14:00:29 2005
[Primary] Alert
LG:0 Logical Drive ALERT: Rebuild Failed
Again, there had been several days (8 days in this case) without error messages. Then, at Thu Dec 15 13:59:48, the disk drive at ID:0 is reported as "Drive Failure" without further explanation. The array had a "hot spare" disk drive configured, and the rebuild started as expected. However, at Thu Dec 15 14:00:29, a second disk drive at ID:2 is also reported as "Drive Failure". Since the rebuild onto the "hot spare" disk drive had not yet completed, when the second disk was reported as "Drive Failure", LD:0 has lost 2 disk drives (IDs 0 and 2) from its RAID-5 configuration, and hence did not have enough redundancy to continue. Host access to that LD was lost.
For comparison purposes, this "Drive Failure" event is not an example of this issue, as the events logged before the "Drive Failure" event clearly show that the disk drive had a genuine fault:
Sun Dec 6 17:20:19 2005
[Primary] Warning
CHL:2 ID:0 Drive ALERT: Drive HW Error (4C4)
Sun Dec 6 17:20:19 2005
[Primary] Warning
CHL:2 ID:0 Drive ALERT: Drive HW Error (4C4)
Sun Dec 6 17:20:19 2005
[Primary] Alert
LG:1 Logical Drive ALERT: CHL:2 ID:0 Drive Failure
Workaround
There is currently no way to predict, or prevent, a disk drive being marked as "bad" by the array controller with one of the messages "Drive Failure", "Media Scan Failed" or "Clone Failed," and with no immediately preceding messages to explain or justify why that disk drive has been marked as "bad". Therefore until further notice, treat these events as genuine disk drive failures (as some of them probably are, although others may not be).
Resolution
This issue is addressed on the following platforms:
- Sun StorEdge 3310 SCSI array with firmware 4.15F (as delivered in patch 113722-15 or later)
- Sun StorEdge 3320 SCSI array with firmware 4.15G (as delivered in patch 113730-01 or later)
- Sun StorEdge 3510 FC array with firmware 4.15F (as delivered in patch 113723-15 or later)
- Sun StorEdge 3511 SATA array with firmware 4.15F (as delivered in patch 113724-09 or later)
Modification HistoryDate: 25-APR-2006
25-Apr-2006:
- Updated Contributing Factors and Resolution sections
Date: 15-JUN-2006
15-Jun-2006:
- Updated Contributing Factors and Resolution sections
AttachmentsThis solution has no attachment