Thursday 22 February. That’s the day it all went wrong.
I was on my way home from a shopping trip at Sainsburys. We’d been on a Thursday instead of a Friday because we had to go to Cornwall on the Friday afternoon for a funeral (that’s bad thing number one). Just after leaving Sainsburys we came across an area that was unusually dark. I guessed this was a power cut, so I was beginning to worry about the state of things at work. Then we hit an area that was lit up, then another that was dark, then another that was lit up. Unfortunately we got to University to find it in darkness. That’s bad thing number two.
I sent Ruth home with the shopping and headed in to see what I could do. Our core equipment has a couple of hours UPS provision and the non-core stuff has virtually no UPS provision, so there was no immediate rush – the machines were either fine or already off. This was when the third bad thing happen – the card locks had failed. Although I could get in to the Octagon lobby I was unable to get anywhere else in the building. After some wandering around the caretaker managed to find a way to the machine room bypassing the card locks, but this still left the challenge of entering the room itself.
I waited for someone else to show up, hopefully with a key for the machine room, but nobody did. In the end I gave up and went home to try and use the computer there to reach people. Success! Someone from operations was coming in with a key. I grabbed a quick bit of food and rushed out of the door to meet them. Suddenly I felt a seering pain in my finger. I glanced at it and wondered why it was purple and throbbing with pain. I’d stupidly managed to catch it in my large metal front door. There’s problem number four. After a few minutes of putting ice on it I headed back up to work.
The next part of the story is a bit dull. I shut everything down and went home to sleep. I’d figured I was better off coming in early the next day rather than staying late, particularly with the state of my finger. 7am next morning I headed back.
It took a good few hours to get everything going again. There were plenty of problems with the cluster which all turned out to be related to the last bad thing. The main disk array (a Sun StorEdge 3510) is a fully redundant unit – it has dual controllers, dual PSUs, dual fibre links, and a RAID 5 disk configuration. One of the power supplies is linked to our UPS and the other to the main machine room UPS. In a power failure the machine room UPS lasts only a few minutes so we drop down to only one PSU. So it would help if that PSU was functioning. It all appeared fine – lights were on, fans whiring – but the unit just wouldn’t power on with just that single PSU. Consequently when the power cut occured that disk array went down.
That problem was compounded by the way we’d set up our coordinator disks. We originally only had this one array so we had three coordinator disks on the one array. Later when we added two more arrays we added a coordinator disk on each of those but fataly left three on the original array. When this array turned off we were left with less than a majority of coordinator disks which caused all the cluster nodes to panic – they assumed they were cut off from the main part of the cluster. So even the services that should have remained running were killed off.
The disk array powering off also caused corruption on one of our filesystems. This was easily fixed with a fsck (thank goodness we had VxFS, otherwise all our ACLs would have vanished – last time I checked that was still a problem with UFS). It had also corrupted our MySQL databases, but after some rather long checks these were also fixed. One of the MySQL databases handled our email services, so we had a whole bunch of problems there too.
We’ve now had the PSU replaced by Sun, with unusually little fuss, so maybe it’s a known problem. But we can’t test it properly without potentially killing the array again, so we’re holding off doing it until we next power everything down. I’ve swapped the power cables over so another power failure shouldn’t land us in such a mess.
Isn’t it great when your highly available systems work? 😉
 
							