On Thursday I arrived back from Cornwall after a fairly lengthy drive, and to get back in to the swing of things I dived right in to the deep end at work.
This weekend there was a complete power shutdown on the campus for some “essential electrical work”. This required us to shut down all our machines, wait a few hours, and then start them all up again. Doesn’t sound too hard, does it?
That’s what I thought anyway. So, to spice things up a bit I figured I’d patch Solaris on all our servers, patch the OBP firmware on all the Sun kit, and update our Veritas cluster with a maintenance pack. My logic behind doing this is that all these things require downtime, and the cluster in particular would be quite disruptive. So what better time to do it than when everything is already down? Doing it the usual way would require downtime on Tuesday mornings for the next month.
On Friday I began patching all the machines I could safely reboot without impacting any of our users. This would have been made easier if our console servers were working, but a quick drive to work fixed that one. This was closely followed by another drive to work to turn the keylock on the servers so I could update the OBP firmwares. At the end of the day I was left with just a few core machines to patch.
Saturday morning started early, some time around 6.30am. I stumbled over to my desk at home to be presented with a dead X session – it looks like Xorg crashed during the night. After about 15-20 minutes of faffing I had everything back running again, and all the relevant tools opened up. I started patching the remaining machines, and the last few OBP firmwares. After a quick shower I popped up to work for around 8am. The power was scheduled to go off at 9am, so I had an hour to make sure everything was shut down. No real problems there, and finished with time to spare. Then I twiddled my thumbs for a further 30 minutes waiting for the power to actually go off.
It’s remarkably eerie to be in the machine room in the dark with most things off. The silence is only broken by the beeps from dying UPSs and the relatively quiet whiring from the core networking equipment (which has an impressively large UPS). Anyway, I digress…
Then follows the boring bit – waiting for the power to come back on. I decided to go in to town, do some shopping, then head off to Sainsburys for the weekly shop. By the time I’d done all this, and watched a bit of TV, it was 2pm and time to head back to work.
The power was scheduled to come back on by 5pm at the latest, but handily it came on earlier. Sometime around 3pm the lights came back on and the air conditioning kicked off with a massive roar. We waited for a further 30 minutes to get the all clear from maintenance before starting to power things on though. I used this time to move a machine, repatch some cables, and get the networking back online.
Earlier I mentioned I also wanted to patch the cluster. I decided to do this as the first thing after bringing the power back online. I’d already arranged for the cluster to not fully start up, so all I needed to do was bring the relevant machines back online and kick off the patch installer. Patching went fine, albeit taking a while; I found a Snickers bar was a good way to fill this time. Next I had to start the cluster up, which wasn’t so easy. After getting it running I couldn’t start any services – it kept returning something similar to the following:
Service group has not been fully probed on the system
It took a fair amount of head scratching, and a bit googling to realise that I needed to take a copy of the latest
types.cf and put it in my VCS config directory. Did I miss that in the upgrade documention, or was it just not there? Either way, after doing that the cluster started up without any further problems.
Next I powered up all the remaining systems in order, which took a while. I did have a couple of problems though:
- One of the mirror service machines paniced on boot. Trying a different kernel fixed that, but the config really should have been right in the first place.
- Our web server has a failing system disk. It’s mirrored, so it’s not a big deal, but the disk keeps limping rather than failing – the result being a pretty slow machine.
At around 7pm I’d got everything going again, so I headed to the office to quickly check my email. When my email client refused to load I was too exhausted to care, so headed home for dinner (and Dr Who).
An hour or so later I went back to try and figure out what was going on. Handily a colleauge had also seen the problem and queried whether
lockd was running. That made sense; I assume my mail client couldn’t lock it’s mailbox on the NFS server. A quick check revealed it wasn’t running on the cluster NFS server. I haven’t investigated why, yet, but I hope it’s something I’ve done wrong rather than yet another bug.
By the time that was sorted, and I’d read all my email, the only remaining thing to do was log the disk fault with Sun. I was surprised to have an engineer get back to me so late on a weekend, but I guess that’s the advantage of Gold support. I eventually gave up conversing by email at around midnight and went to bed.
I awoke later than usual on Sunday thinking everything was fine. I wandered over to my computer, which hadn’t crashed this time, and was rather annoyed at what I saw; a whole bunch of error messages about not being able to contact work servers. A load of things went through my head:
- Has myrtle crashed? hard to tell, can’t get at the console.
- Has the power gone off again? can’t get at any of our machines, but I can get to a service one, so seems not.
- Has the networking died? can get to service equipment and can ping at least one of our routers.
So back up to work, again, to see what’s going on. I concluded it was a problem with the service router that our router is connected to (it’s happened before), so I pulled the cable out. After a short pause the failover link from our second router to the second service router kicked in, and most stuff started ticking again. There are still routing problems, though, which means amongst other things that we’re only getting some emails.
Back at home I waded through a whole mass of emails generated by the network outage, and find a few from Sun. They’ve also concluded the disk in the webserver is dead, so hopefully we’ll get that replaced on Tuesday.
That leaves me with a few things to sort out:
- Investigate why the cluster didn’t start
- Find out what happened to the networking.
- Get the disk changed in the webserver.
But they can certainly wait until I’m back at work.