Bad things come in fives.

Thursday 22 February. That’s the day it all went wrong.

I was on my way home from a shopping trip at Sainsburys. We’d been on a Thursday instead of a Friday because we had to go to Cornwall on the Friday afternoon for a funeral (that’s bad thing number one). Just after leaving Sainsburys we came across an area that was unusually dark. I guessed this was a power cut, so I was beginning to worry about the state of things at work. Then we hit an area that was lit up, then another that was dark, then another that was lit up. Unfortunately we got to University to find it in darkness. That’s bad thing number two.

I sent Ruth home with the shopping and headed in to see what I could do. Our core equipment has a couple of hours UPS provision and the non-core stuff has virtually no UPS provision, so there was no immediate rush – the machines were either fine or already off. This was when the third bad thing happen – the card locks had failed. Although I could get in to the Octagon lobby I was unable to get anywhere else in the building. After some wandering around the caretaker managed to find a way to the machine room bypassing the card locks, but this still left the challenge of entering the room itself.

I waited for someone else to show up, hopefully with a key for the machine room, but nobody did. In the end I gave up and went home to try and use the computer there to reach people. Success! Someone from operations was coming in with a key. I grabbed a quick bit of food and rushed out of the door to meet them. Suddenly I felt a seering pain in my finger. I glanced at it and wondered why it was purple and throbbing with pain. I’d stupidly managed to catch it in my large metal front door. There’s problem number four. After a few minutes of putting ice on it I headed back up to work.

The next part of the story is a bit dull. I shut everything down and went home to sleep. I’d figured I was better off coming in early the next day rather than staying late, particularly with the state of my finger. 7am next morning I headed back.

It took a good few hours to get everything going again. There were plenty of problems with the cluster which all turned out to be related to the last bad thing. The main disk array (a Sun StorEdge 3510) is a fully redundant unit – it has dual controllers, dual PSUs, dual fibre links, and a RAID 5 disk configuration. One of the power supplies is linked to our UPS and the other to the main machine room UPS. In a power failure the machine room UPS lasts only a few minutes so we drop down to only one PSU. So it would help if that PSU was functioning. It all appeared fine – lights were on, fans whiring – but the unit just wouldn’t power on with just that single PSU. Consequently when the power cut occured that disk array went down.

That problem was compounded by the way we’d set up our coordinator disks. We originally only had this one array so we had three coordinator disks on the one array. Later when we added two more arrays we added a coordinator disk on each of those but fataly left three on the original array. When this array turned off we were left with less than a majority of coordinator disks which caused all the cluster nodes to panic – they assumed they were cut off from the main part of the cluster. So even the services that should have remained running were killed off.

The disk array powering off also caused corruption on one of our filesystems. This was easily fixed with a fsck (thank goodness we had VxFS, otherwise all our ACLs would have vanished – last time I checked that was still a problem with UFS). It had also corrupted our MySQL databases, but after some rather long checks these were also fixed. One of the MySQL databases handled our email services, so we had a whole bunch of problems there too.

We’ve now had the PSU replaced by Sun, with unusually little fuss, so maybe it’s a known problem. But we can’t test it properly without potentially killing the array again, so we’re holding off doing it until we next power everything down. I’ve swapped the power cables over so another power failure shouldn’t land us in such a mess.

Isn’t it great when your highly available systems work? ;-)

  • Share/Bookmark

Related posts:

  1. Erm, whoops? I’d finally finished migrating everything off the old myrtle disk arrays, so I was feeling quite pleased. I’d just unplugged the last array from myrtle and plugged it in to the test machine for wiping. Then I tried to log in to the machine room SunRay, but strangely it didn’t work. I checked the console logs for [...]...
  2. A T3 goes bang We have a fairly long standing hatred of the Sun T3 storage arrays, and last night they once again proved why we feel that way. At around 7pm last night I noticed a lot of SCSI errors on myrtle (our staff and research Solaris server) which I quickly tracked down to a problem with one of the attached [...]...
  3. What a weekend On Thursday I arrived back from Cornwall after a fairly lengthy drive, and to get back in to the swing of things I dived right in to the deep end at work. This weekend there was a complete power shutdown on the campus for some “essential electrical work”. This required us to shut down all our [...]...
  4. “Any idea WTF is going on?” “Any idea WTF is going on?” is what I read on my phone as I stumbled out of bed this morning. It was from one of my colleagues who, for some reason I can’t understand, seems to like getting in to work at a ridiculous hour in the morning. Still half asleep I plodded through to [...]...
  5. Increasing our storage provision During the summer we started getting tight on storage availability. It seems that usage on our home directory areas constantly increases – people never delete stuff (me included!). We were running most of our stuff through our Veritas Cluster from a pair of Sun 3511 arrays and a single 3510 array. Between them (taking mirroring [...]...

Leave a Reply