This morning after waking up to lots of thunder and lightning, I got a text message saying my raid5 array had failed. Only this time, 2 of the 3 drives were missing. Since both of those drives were actually mounted via a vblade share (on a different physical machine), I assumed that the other server had freaked out during a power surge. I quickly rebooted the machine to bring back the vblade shares, but then the trouble started.
At some point, the array was "started" but had two faulty drives. I tried --remove and --add to remove and re-add the "faulty" drives. This had the effect of bringing the array back "online" with all the drives as spares. I removed the drives again, and tried the trick I used last time:
mdadm --assemble -f /dev/md0 /dev/sda2 /dev/etherd/e4.1 /dev/etherd/e4.2
However, this also didn't work. It showed the array with /dev/sda2 and /dev/etherd/e4.2 as spares, and e4.1 was nowhere to be seen. At this point I was a little more than worried that I had done something to trash the array. That's when a google search led me to this handy command:
mdadm -E /dev/sda2
This prints out the superblock information that is present on the hard drive. This told me that the e4.2 drive had not been damaged, since I was able to see information there. Also, the UUIDs on all three drives still matched. However, the bottom section of the report differed on all the drives.
A few google searches later, and I came across this:
mdadm --create --assume-clean --level=5 --raid-devices=3 /dev/md0 /dev/sda2 /dev/etherd/e4.2 /dev/etherd/e4.1
Using the --assume-clean flag tells mdadm not to write any data to the drives, or to start initializing the array. However, what I didn't realize was that it would reset the UUIDs. That command brought the array back online, at least according to /proc/mdstat, but when I tried to mount it, it couldn't figure out the filesystem.
That's when I realized that the order in which you specify the drives to the --create command actually matters. I re-ran the command like this:
mdadm --create --assume-clean --level=5 --raid-devices=3 /dev/md0 /dev/sda2 /dev/etherd/e4.1 /dev/etherd/e4.2
The array came back online, and I was able to mount it!
So while RAID 5 protects against a single hard drive failing, it does not protect against me running stupid commands on the array. I'm going to have to start backing up my raid arrays onto other drives...
Useful Links