hard drive fail^2, and how Drobo will save me. (I hope)
So a couple years ago, I had a hard drive fail in my home raid array. The array is a "7 disk" array, with one of those 7 being a cold spare. There is no hot spare, the remaining 6 drives are raid5'd together. This isn't a business critical array, so I'm OK with the small risk and downtime that leaves.
At the time I did the same thing I always do when this type of failure happens... I shut down the array immediately upon seeing a dead drive, replaced the dead drive with the cold spare as soon as I could (following weekend I think) and brought the array back up to let it rebuild on the new drive.
It turns out that the failed drive was still under warrantee! So I did the Hitachi RMA dance and got a replacement. When I first got the replacement a month later I checked the serial number for warrantee and was told it was expired. Hitachi fixed that up so that I got the 2 year warrantee from the date they sent it out to me.
Said warrantee ended November 2009.
Said hard drive sat, unopened, in their shipping container next to the server containing the raid from then until last week, when another Hitachi DeskStar died in the array. What should have been a 45 minute exercise to get the raid back went south about 40 minutes into it. The "new" drive... the cold spare... dead. For all I know it was dead when it first got here. I see no reason to think it wasn't.
Now I'm sure if I got in touch with Hitachi Global Storage, that they would eventually, grudgingly, send me another death star. But frankly I don't *want* any more hitachi drives. Sorry guys. (I know a good number of folks from the team that got sold off to Hitachi along with IBM's HDD business.)
So two routes to rebuilding this... I could buy yet another 300ish gig PATA drive to replace the one that failed. Or I could buy a nice shiny new 2TB SATA drive and start the migration to Drobo, one drive at a time. Turns out the 2TB drive isn't all that much more expensive that the 320G PATA drives I've been using as replacements for the last 3 failed drives. So that was a pretty easy answer. (The server the raid is in happens to have a pair of SATA ports on the mobo that I'm not using... since the server OS loads from usb flash drives, and the raid array is all PATA... so there's zero additional cost associated with that idea.)
I grabbed a nice 2TB WD drive, even splurged the extra $20 for double the onboard cache. I'll carve a 300G partition into it and put that into the raid set as the replacement element, then use the remaining 1.7TB (more than the entire raid array!) as scratch space for testing another filesystem, maybe for migration. (ext4? btrfs? would consider ZFS if it wasn't dead. :( )
So now I've got a disk that a drobo can use... eventually I'll buy a second one (probably when another drive in the array dies.) Then I'll buy a drobo, move all the data from the array to one of these drives (in the 1.7TB partition not used by the array), pop the second drive into the drobo, move the data over to it, then pop the other drive into the drobo and let it do it's thing.
Unless someone wants to buy a really nice 6 bay server case with a low power system in it and a great fan setup for cooling hard drives. :) Then I'll just get the drobo and a droboshare sooner. (That's about all that is left of the server is storage. I've retired all the other things that this server did, or moved them to the cloud. It's sole purpose these days is storage... which it does pretty well.)
Now hopefully by then, Drobo will have a newer droboshare that's compatible with the bigger arrays like the S... and connects via eSATA or FW800. And hopefully they also have a larger enclosure (6-8 drives) that doesn't have all the extra enterprise features... or has the droboshare built in.
At the time I did the same thing I always do when this type of failure happens... I shut down the array immediately upon seeing a dead drive, replaced the dead drive with the cold spare as soon as I could (following weekend I think) and brought the array back up to let it rebuild on the new drive.
It turns out that the failed drive was still under warrantee! So I did the Hitachi RMA dance and got a replacement. When I first got the replacement a month later I checked the serial number for warrantee and was told it was expired. Hitachi fixed that up so that I got the 2 year warrantee from the date they sent it out to me.
Said warrantee ended November 2009.
Said hard drive sat, unopened, in their shipping container next to the server containing the raid from then until last week, when another Hitachi DeskStar died in the array. What should have been a 45 minute exercise to get the raid back went south about 40 minutes into it. The "new" drive... the cold spare... dead. For all I know it was dead when it first got here. I see no reason to think it wasn't.
Now I'm sure if I got in touch with Hitachi Global Storage, that they would eventually, grudgingly, send me another death star. But frankly I don't *want* any more hitachi drives. Sorry guys. (I know a good number of folks from the team that got sold off to Hitachi along with IBM's HDD business.)
So two routes to rebuilding this... I could buy yet another 300ish gig PATA drive to replace the one that failed. Or I could buy a nice shiny new 2TB SATA drive and start the migration to Drobo, one drive at a time. Turns out the 2TB drive isn't all that much more expensive that the 320G PATA drives I've been using as replacements for the last 3 failed drives. So that was a pretty easy answer. (The server the raid is in happens to have a pair of SATA ports on the mobo that I'm not using... since the server OS loads from usb flash drives, and the raid array is all PATA... so there's zero additional cost associated with that idea.)
I grabbed a nice 2TB WD drive, even splurged the extra $20 for double the onboard cache. I'll carve a 300G partition into it and put that into the raid set as the replacement element, then use the remaining 1.7TB (more than the entire raid array!) as scratch space for testing another filesystem, maybe for migration. (ext4? btrfs? would consider ZFS if it wasn't dead. :( )
So now I've got a disk that a drobo can use... eventually I'll buy a second one (probably when another drive in the array dies.) Then I'll buy a drobo, move all the data from the array to one of these drives (in the 1.7TB partition not used by the array), pop the second drive into the drobo, move the data over to it, then pop the other drive into the drobo and let it do it's thing.
Unless someone wants to buy a really nice 6 bay server case with a low power system in it and a great fan setup for cooling hard drives. :) Then I'll just get the drobo and a droboshare sooner. (That's about all that is left of the server is storage. I've retired all the other things that this server did, or moved them to the cloud. It's sole purpose these days is storage... which it does pretty well.)
Now hopefully by then, Drobo will have a newer droboshare that's compatible with the bigger arrays like the S... and connects via eSATA or FW800. And hopefully they also have a larger enclosure (6-8 drives) that doesn't have all the extra enterprise features... or has the droboshare built in.
2 Comments:
I've had nearly the same thing happen to me. Had a 72G WD Raptor replaced on warranty, got a "recertified" (I.e. known to have been broken at some point) drive back. Started using it on another system months later and it died HARD within 2 weeks. As in wouldn't even probe on the bus.
Lesson learned: Don't assume that any drives are actually working. New or replaced.
Oh, and my experience with a WDC Green Caviar (1TB): Silent data corruption, no SMART errors, no SATA-level ECC errors. Just lots and lots of bad data back from it. Must have been an issue with the on-board cache since everything else (should be) checksummed.
RAID didn't save me since it doesn't actually checksum data. (Well, RAID6 on linux will do it once a month now). :(
If you want to run WD Green Cavier drives in a NAS or RAID storage, you need to make sure that you disable some of the green features...like IntelliSleep or whatever they call it by running WDIDLE3 and also configuring time limited error recovery using WDTLER.
Post a Comment
<< Home