“Any idea WTF is going on?” is what I read on my phone as I stumbled out of bed this morning. It was from one of my colleagues who, for some reason I can’t understand, seems to like getting in to work at a ridiculous hour in the morning.
Still half asleep I plodded through to my desk and sat down at my computer. I tried to check my email but nothing was responding. Then I saw the message “NFS server resfs.fs.cs not responding”… and woke up rather quickly. This meant either our network was shafted, or more likely, the cluster had blown up again.
I discovered one of the cluster nodes was offline and marked as failed, and the service group that manages our filestore was also marked as failed. That was odd, but it had happened before. I dug a bit further and found a screenful of SCSI errors. This was bad – something must have gone wrong with the storage.
Next I checked the arrays. The first one I checked had numerous errors on it; failed disks, missing disks, and drive not ready messages. I can’t stress enough how important this data is – it holds files, email and shared areas for all the staff and researchers in our department, and I really didn’t want to explain to them that we’d lost it all (well, we do have backups…). I nervously moved on to check the second array – they were mirrored, so as long as one was OK we’d be fine – and I was delighted to find no error messages.
So, now I knew that the likely cause of the problems was an array failing. It turned out later on to be the controller in this array, which was a good thing because Sun managed to send the wrong disks anyway. The next steps were to get the array fixed and to get things back online. I asked my colleague, who was already in the office, to disconnect the fibres from the failed array (to keep it completely out of the loop whilst it was fixed) and get on to Sun to fix it. Whilst he did that I, still at home, not dressed, and without breakfast, got on with getting things back online.
This, in theory, should have been the easy part. We had a mirrored setup so the plan was to just bring the volume back online with only half of the mirror. No problem, I thought. Except when it wouldn’t come online. When the initial problem had occured the cluster software (VCS) had failed to unmount the disk from the node it was on. It had decided that it needed to do this to bring it online on another node (little did it know that it wouldn’t work on any other node either), so as a last resort it asked the machine to panic. This is something akin to asking it to commit suicide. It duely did it’s job, but in the process left the disks in an odd state.
When I tried to mount these disks on one of the other nodes I got errors from the volume manager telling me a split brain had occured (this happens when a live cluster splits in two, but neither half can see the other). I knew that wasn’t the case, so I tried to force the mount. That failed with write errors. After a lot of head scratching I realised it was probably the I/O fencing stopping this node from accessing the disk. Whilst frustrating, it was nice to see the software behaving as it should – in a real split brain situation this is exactly what you want.
A while later I figured out how to clear the SCSI3 reservations on the disks (-o clearreserve option to vxdg import). This was nearly enough. Another issue with the split brain was that the configuration data stored on the disks didn’t quite match (I’m not 100% sure why, but I believe the node that paniced hadn’t managed to consistently update the metadata). After dumping the configuration it was clear that they were identical, bar a revision number, so by using -o selectcp we were able to get the diskgroup imported.
vxdg -fC -o clearreserve -o selectcp=1128804183.107.qetesh \
Success! The diskgroup was online. From here it was just a case of waiting for fsck to confirm everything looked OK and then unleashing VCS to bring the service group back online.
By this point Sun had sent out an engineer and parts to fix the other array (we get a good service from them, thankfully). That’s currently resyncing its disks, which will take a day or two. Once that’s done we’ll hook it back in to the fibre fabric and bring things back online. It’ll take just as long again to resync the data, but all I have to do is sit and watch 🙂
Finally, after hours of investigation I finally found out the cause of all the problems. We’ve just ordered a newer, bigger array. The old ones are just jealous.
(And a quick thanks to Pete for his help in debugging things this morning 🙂 )