CSProjects Upgrade

Since launching CSProjects last year we’ve had nearly 80 projects set up on it. I’m pleased with the success it’s had so far, but I’m still surprised by people not knowning about it. Recently I’ve been trying to do more to publicise it, with some success.

Over the past couple of weeks I’ve spent some time sorting out the upgrade to Trac 0.11. It’s not significantly different from the 0.10.4 version that we’ve been running for the past year, but it did require some config and code changes. We’ve rolled that out today, and so far there haven’t been any problems.

Next on the agenda is upgrading Subversion to 1.5. Hopefully that’ll be fairly easy 🙂

Share

A solution? It’s all Sun’s fault.

I’ve made a couple of posts lately about the problems we’ve been having with one of our Sun 3511 disk arrays. Sun got back to me today with what they thought the problem was. Here’s the gory detail (slightly truncated to fit this blog post):

* SES

Ch  Id Chassis Vendor/Product ID        Rev  PLD
-------------------------------------------------
 2  28 092131  SUN StorEdge 3511F A     0430 1000
 3  28 092131  SUN StorEdge 3511F D     0420 1100*

* indicates SES or PLD firmware mismatch.

This appears in a few places, so I’ve only quoted the one Sun spotted. It boils down to the components in the array having inconsistent firmware revisions. This could very well have caused the crash we saw yesterday.

This is something I blame Sun for. Firstly, they shipped out a controller with mismatched firmwares on it. I guess this sort of thing might happen though, but the field engineer should really have spotted the mistake when he was onsite and getting the replacement controller configured.

Today Sun wanted to send out another engineer to get the firmware updated, and when I came back from shifting some stuff around I had a voicemail from dispatch. It’s good to see them being so proactive at fixing the issue, although I wonder if it’s because they realised it was their fault?

However, being as I am a sysadmin, I figured I could save everyone a lot of time and hassle if I did the upgrade myself. Sun gave me the link and a couple of hours later it was all sorted (and I even managed to shift some furniture around the building in the middle of it). We’ve agreed with Sun to wait until next week to ensure things are working as they should, then we’ll close the case.

In the meantime I have our largest volume resyncing, but have some familiar looking problems cropping up with the other. Somehow I fear this saga isn’t quite over yet…

Share

Sigh. Stupid array.

After all the fun the other day I was hoping for some time to work on other stuff this week. By the end of the weekend the array had finishing syncing and I’d remirrored all the volumes back on to it. It was all ticking over nicely, until this morning…

Unrecoverable Controller Error Encountered !
Resetting Controller !

After which the array disappeared. On arrival at work I power cycled the unit and it came back up without any problems (albeit needing resyncing, again). But this isn’t good enough – the unit is still faulty.

So I’ve logged the case with Sun, again. They’re being remarkably slow to respond today, despite me logging it as high priority. But, as you’d expect, they just got it within their 4 hour SLA window.

So it continues…

Share

“Any idea WTF is going on?”

“Any idea WTF is going on?” is what I read on my phone as I stumbled out of bed this morning. It was from one of my colleagues who, for some reason I can’t understand, seems to like getting in to work at a ridiculous hour in the morning.

Still half asleep I plodded through to my desk and sat down at my computer. I tried to check my email but nothing was responding. Then I saw the message “NFS server resfs.fs.cs not responding”… and woke up rather quickly. This meant either our network was shafted, or more likely, the cluster had blown up again.

I discovered one of the cluster nodes was offline and marked as failed, and the service group that manages our filestore was also marked as failed. That was odd, but it had happened before. I dug a bit further and found a screenful of SCSI errors. This was bad – something must have gone wrong with the storage.

Next I checked the arrays. The first one I checked had numerous errors on it; failed disks, missing disks, and drive not ready messages. I can’t stress enough how important this data is – it holds files, email and shared areas for all the staff and researchers in our department, and I really didn’t want to explain to them that we’d lost it all (well, we do have backups…). I nervously moved on to check the second array – they were mirrored, so as long as one was OK we’d be fine – and I was delighted to find no error messages.

So, now I knew that the likely cause of the problems was an array failing. It turned out later on to be the controller in this array, which was a good thing because Sun managed to send the wrong disks anyway. The next steps were to get the array fixed and to get things back online. I asked my colleague, who was already in the office, to disconnect the fibres from the failed array (to keep it completely out of the loop whilst it was fixed) and get on to Sun to fix it. Whilst he did that I, still at home, not dressed, and without breakfast, got on with getting things back online.

This, in theory, should have been the easy part. We had a mirrored setup so the plan was to just bring the volume back online with only half of the mirror. No problem, I thought. Except when it wouldn’t come online. When the initial problem had occured the cluster software (VCS) had failed to unmount the disk from the node it was on. It had decided that it needed to do this to bring it online on another node (little did it know that it wouldn’t work on any other node either), so as a last resort it asked the machine to panic. This is something akin to asking it to commit suicide. It duely did it’s job, but in the process left the disks in an odd state.

When I tried to mount these disks on one of the other nodes I got errors from the volume manager telling me a split brain had occured (this happens when a live cluster splits in two, but neither half can see the other). I knew that wasn’t the case, so I tried to force the mount. That failed with write errors. After a lot of head scratching I realised it was probably the I/O fencing stopping this node from accessing the disk. Whilst frustrating, it was nice to see the software behaving as it should – in a real split brain situation this is exactly what you want.

A while later I figured out how to clear the SCSI3 reservations on the disks (-o clearreserve option to vxdg import). This was nearly enough. Another issue with the split brain was that the configuration data stored on the disks didn’t quite match (I’m not 100% sure why, but I believe the node that paniced hadn’t managed to consistently update the metadata). After dumping the configuration it was clear that they were identical, bar a revision number, so by using -o selectcp we were able to get the diskgroup imported.

vxdg -fC -o clearreserve -o selectcp=1128804183.107.qetesh \
    import ResFS

Success! The diskgroup was online. From here it was just a case of waiting for fsck to confirm everything looked OK and then unleashing VCS to bring the service group back online.

By this point Sun had sent out an engineer and parts to fix the other array (we get a good service from them, thankfully). That’s currently resyncing its disks, which will take a day or two. Once that’s done we’ll hook it back in to the fibre fabric and bring things back online. It’ll take just as long again to resync the data, but all I have to do is sit and watch 🙂

Finally, after hours of investigation I finally found out the cause of all the problems. We’ve just ordered a newer, bigger array. The old ones are just jealous.

(And a quick thanks to Pete for his help in debugging things this morning 🙂 )

Share

FreeBSD stuff

I’ve done a bit of work on my FreeBSD ports lately. Firstly, after building my new server, I got round to upgrading from SlimServer to SqueezeCenter. This also meant sorting out ports for all the plugins I use. This didn’t take too long, and you can find them all over here. So far I’m liking SqueezeCenter, and I’d highly recommend it (and a SqueezeBox, of course).

I also maintain a port for a suite of software called KRoC. KRoC is written and maintained where I work, so apart from making it available to FreeBSD users I also have an interest in supporting the work done by our department. I’ve been waiting some time for a 1.5.x release of KRoC, but I finally got impatient. I automated the production of snapshots from their stable branch, and updated the port to build from that. I also run a FreeBSD 7 machine in their buildbot system to further test KRoC on my favourite operating system 🙂

And in other FreeBSD news, I cast my vote in the FreeBSD Core elections. It’s hard to know who to vote for, but I gave their statements a good read and made a decision. Good luck to them all!

Share