“Any idea WTF is going on?” is what I read on my phone as I stumbled out of bed this morning. It was from one of my colleagues who, for some reason I can’t understand, seems to like getting in to work at a ridiculous hour in the morning.
Still half asleep I plodded through to my desk and sat down at my computer. I tried to check my email but nothing was responding. Then I saw the message “NFS server resfs.fs.cs not responding”… and woke up rather quickly. This meant either our network was shafted, or more likely, the cluster had blown up again.
I discovered one of the cluster nodes was offline and marked as failed, and the service group that manages our filestore was also marked as failed. That was odd, but it had happened before. I dug a bit further and found a screenful of SCSI errors. This was bad – something must have gone wrong with the storage.
Next I checked the arrays. The first one I checked had numerous errors on it; failed disks, missing disks, and drive not ready messages. I can’t stress enough how important this data is – it holds files, email and shared areas for all the staff and researchers in our department, and I really didn’t want to explain to them that we’d lost it all (well, we do have backups…). I nervously moved on to check the second array – they were mirrored, so as long as one was OK we’d be fine – and I was delighted to find no error messages.
So, now I knew that the likely cause of the problems was an array failing. It turned out later on to be the controller in this array, which was a good thing because Sun managed to send the wrong disks anyway. The next steps were to get the array fixed and to get things back online. I asked my colleague, who was already in the office, to disconnect the fibres from the failed array (to keep it completely out of the loop whilst it was fixed) and get on to Sun to fix it. Whilst he did that I, still at home, not dressed, and without breakfast, got on with getting things back online.
This, in theory, should have been the easy part. We had a mirrored setup so the plan was to just bring the volume back online with only half of the mirror. No problem, I thought. Except when it wouldn’t come online. When the initial problem had occured the cluster software (VCS) had failed to unmount the disk from the node it was on. It had decided that it needed to do this to bring it online on another node (little did it know that it wouldn’t work on any other node either), so as a last resort it asked the machine to panic. This is something akin to asking it to commit suicide. It duely did it’s job, but in the process left the disks in an odd state.
When I tried to mount these disks on one of the other nodes I got errors from the volume manager telling me a split brain had occured (this happens when a live cluster splits in two, but neither half can see the other). I knew that wasn’t the case, so I tried to force the mount. That failed with write errors. After a lot of head scratching I realised it was probably the I/O fencing stopping this node from accessing the disk. Whilst frustrating, it was nice to see the software behaving as it should – in a real split brain situation this is exactly what you want.
A while later I figured out how to clear the SCSI3 reservations on the disks (-o clearreserve option to vxdg import). This was nearly enough. Another issue with the split brain was that the configuration data stored on the disks didn’t quite match (I’m not 100% sure why, but I believe the node that paniced hadn’t managed to consistently update the metadata). After dumping the configuration it was clear that they were identical, bar a revision number, so by using -o selectcp we were able to get the diskgroup imported.
vxdg -fC -o clearreserve -o selectcp=1128804183.107.qetesh \
import ResFS
Success! The diskgroup was online. From here it was just a case of waiting for fsck to confirm everything looked OK and then unleashing VCS to bring the service group back online.
By this point Sun had sent out an engineer and parts to fix the other array (we get a good service from them, thankfully). That’s currently resyncing its disks, which will take a day or two. Once that’s done we’ll hook it back in to the fibre fabric and bring things back online. It’ll take just as long again to resync the data, but all I have to do is sit and watch
Finally, after hours of investigation I finally found out the cause of all the problems. We’ve just ordered a newer, bigger array. The old ones are just jealous.
(And a quick thanks to Pete for his help in debugging things this morning
)
Related posts:
- The T3 lives? After yesterdays saga I was looking forward to an easier day today, but I didn’t get it. At the end of my last post I was trying to disable the primary controller in the array. It took a while, but it didn’t help. However, after some more discussion with Paul at Sun we noticed a lot [...]...
- A T3 goes bang We have a fairly long standing hatred of the Sun T3 storage arrays, and last night they once again proved why we feel that way. At around 7pm last night I noticed a lot of SCSI errors on myrtle (our staff and research Solaris server) which I quickly tracked down to a problem with one of the attached [...]...
- Increasing our storage provision During the summer we started getting tight on storage availability. It seems that usage on our home directory areas constantly increases – people never delete stuff (me included!). We were running most of our stuff through our Veritas Cluster from a pair of Sun 3511 arrays and a single 3510 array. Between them (taking mirroring [...]...
- The end of the T3 saga So after copying everyone off the limping T3 arrays I arranged for a Sun engineer to return to site to fix it properly. Sun Dispatch had a bit of a moan because I’d had the parts for too long, but they realised it’d make most sense to keep the parts on site rather than collect [...]...
- Bad things come in fives. Thursday 22 February. That’s the day it all went wrong. I was on my way home from a shopping trip at Sainsburys. We’d been on a Thursday instead of a Friday because we had to go to Cornwall on the Friday afternoon for a funeral (that’s bad thing number one). Just after leaving Sainsburys we came across [...]...
You were doing stuff without having had breakfast?!
I am horrified.
Still it was a fun day.