Increasing our storage provision

During the summer we started getting tight on storage availability. It seems that usage on our home directory areas constantly increases – people never delete stuff (me included!). We were running most of our stuff through our Veritas Cluster from a pair of Sun 3511 arrays and a single 3510 array. Between them (taking mirroring in to account) we had around 3TB of space.

Now, it’s a well known fact with maintenance contracts that the cost goes up over time (parts get more scarce and more costly). So we did the sums on the cost we were paying for the old arrays and realised that over a sensible lifetime period it was cheaper to replace them. So we got a pair of Sun 2540 arrays with a 12TB capacity each.

Since our data is absolutely precious we mirror these arrays and use RAID 6. This gives us just under 10 TB of usable space, which is a fair amount more than we started with.

The next stage was bring this online. Because we use Veritas Volume Manager and the Veritas File System we were able to do this almost transparently. The new arrays were provisioned and added to the relevant diskgroups. The volumes were then mirrored on to them and then the filesystems expanded. Finally the old arrays were disconnected. All of this was done without any downtime or interruption to our users or services.

I said almost transparently though. It seems it’s not possible to change the VCS coordinator disks without taking the diskgroups offline and back online (this might be improved in VCS 5). So I rebooted the whole cluster last weekend and it was all finished.

The problem with all this clever technology? Nobody knows we’ve done it. After weeks of work we grew the filesystems just before they completely filled and without any noticable downtime. We’d probably get more gratitude if we’d let it fill up first 😉


Dovecot is a neat piece of software

For many years now (since before I started working at the University) we’ve been using University of Washington’s IMAP and POP daemons. They worked well, and (through an old bit of unsupported code) also allowed our MH users to access their email.

As time went on people wanted to do more than UW’s software could offer. Things like nested folders, and server-side caching. That’s when we started running Courier IMAP in parallel. This worked, but required users to use a non-standard set of port numbers.

At the time we looked at Dovecot, but it was fairly new, and we were unsure about trusting it with all our users’ email. That was a few years ago, so I decided this week to take another look. This was mainly driven by demand for faster IMAP access to the Maildir folders served by Courier IMAP.

My first impressions were good. I read through much of the stuff on the Dovecot wiki, and I kept thinking and saying to my colleagues “wow, that’s really neat”. Dovecot came across as a well thought out and well structured program, with a vast amount of useful tips and configuration ideas on their website. The level of customisation was good too, right down to allowing you to write a shell script to put in-line and tweak configuration to meet your exact needs.

After a few days of fiddling around I’ve managed to get a setup working that can replace both of our ageing Courier IMAP and UW IMAP installations. It should be a fairly seemless transition for our users, but I’m sure it won’t be that simple in practice. I’ve written a shell script that automatically detects at runtime where a user’s mail might be and sets the configuration accordingly. The script also allows the user to override the mail location and turn on debugging options.

And then there’s the performance issues. One of my colleagues has been having issues with the speed of Courier IMAP, and so far he’s impressed with Dovecot. The main gain here was the ability to store indexes in a separate location. Our mail is stored on an NFS server which becomes a performance bottleneck when using Maildir. Dovecot works around this by storing indexes and caches on a local disk making response times better.

Finally, there’s support. I hit a couple of issues getting things set up so I made use of the Dovecot mailing list. The response times in both cases were brilliant, and in both cases I got an answer to my problem straight away (maybe I asked common or stupid questions? 🙂 ).

So Dovecot comes highly recommended from me. Give it a try!

(And what about the MH users? Thankfully most have moved on to other things like Maildir & Thunderbird.)


“Disc quota exceeded”

Today we saw a strange problem on our Solaris hosts that NFS mount VxFS filestore from our Veritas cluster. The users were seeing “Disc quota exceeded” messages, whilst the quota command wasn’t showing they’d hit their limit. After some digging on the cluster node we found the following error message:

Sep 12 11:04:33 bes vxfs: [ID 702911 kern.warning]
WARNING: msgcnt 10 mesg 089: V-2-89:
quotas on /cluster/ResFS invalid;
disk usage for group id 2805 exceeds 2147483646 sectors

Ah-ha! Group quota! We hadn’t even set group quotas, but it appears the system tracks the usage anyway when you mount with -o quota. Some googling turned up the following document:

So it turns out there’s a 1TB maximum limit when using quotas. Since we weren’t using group quotas the simple option was to disable them:

vxquotaoff -gv /cluster/ResFS

Then I edited the Mount resource and changed the quota mount option to usrquota.

This only alleviates the problem for a while. Eventually someone will need to use 1TB of storage for themselves, but hopefully that’s a little way off yet. Maybe we’ll be using ZFS by then anyway 🙂


CSProjects Upgrade

Since launching CSProjects last year we’ve had nearly 80 projects set up on it. I’m pleased with the success it’s had so far, but I’m still surprised by people not knowning about it. Recently I’ve been trying to do more to publicise it, with some success.

Over the past couple of weeks I’ve spent some time sorting out the upgrade to Trac 0.11. It’s not significantly different from the 0.10.4 version that we’ve been running for the past year, but it did require some config and code changes. We’ve rolled that out today, and so far there haven’t been any problems.

Next on the agenda is upgrading Subversion to 1.5. Hopefully that’ll be fairly easy 🙂


A solution? It’s all Sun’s fault.

I’ve made a couple of posts lately about the problems we’ve been having with one of our Sun 3511 disk arrays. Sun got back to me today with what they thought the problem was. Here’s the gory detail (slightly truncated to fit this blog post):


Ch  Id Chassis Vendor/Product ID        Rev  PLD
 2  28 092131  SUN StorEdge 3511F A     0430 1000
 3  28 092131  SUN StorEdge 3511F D     0420 1100*

* indicates SES or PLD firmware mismatch.

This appears in a few places, so I’ve only quoted the one Sun spotted. It boils down to the components in the array having inconsistent firmware revisions. This could very well have caused the crash we saw yesterday.

This is something I blame Sun for. Firstly, they shipped out a controller with mismatched firmwares on it. I guess this sort of thing might happen though, but the field engineer should really have spotted the mistake when he was onsite and getting the replacement controller configured.

Today Sun wanted to send out another engineer to get the firmware updated, and when I came back from shifting some stuff around I had a voicemail from dispatch. It’s good to see them being so proactive at fixing the issue, although I wonder if it’s because they realised it was their fault?

However, being as I am a sysadmin, I figured I could save everyone a lot of time and hassle if I did the upgrade myself. Sun gave me the link and a couple of hours later it was all sorted (and I even managed to shift some furniture around the building in the middle of it). We’ve agreed with Sun to wait until next week to ensure things are working as they should, then we’ll close the case.

In the meantime I have our largest volume resyncing, but have some familiar looking problems cropping up with the other. Somehow I fear this saga isn’t quite over yet…


Sigh. Stupid array.

After all the fun the other day I was hoping for some time to work on other stuff this week. By the end of the weekend the array had finishing syncing and I’d remirrored all the volumes back on to it. It was all ticking over nicely, until this morning…

Unrecoverable Controller Error Encountered !
Resetting Controller !

After which the array disappeared. On arrival at work I power cycled the unit and it came back up without any problems (albeit needing resyncing, again). But this isn’t good enough – the unit is still faulty.

So I’ve logged the case with Sun, again. They’re being remarkably slow to respond today, despite me logging it as high priority. But, as you’d expect, they just got it within their 4 hour SLA window.

So it continues…


“Any idea WTF is going on?”

“Any idea WTF is going on?” is what I read on my phone as I stumbled out of bed this morning. It was from one of my colleagues who, for some reason I can’t understand, seems to like getting in to work at a ridiculous hour in the morning.

Still half asleep I plodded through to my desk and sat down at my computer. I tried to check my email but nothing was responding. Then I saw the message “NFS server resfs.fs.cs not responding”… and woke up rather quickly. This meant either our network was shafted, or more likely, the cluster had blown up again.

I discovered one of the cluster nodes was offline and marked as failed, and the service group that manages our filestore was also marked as failed. That was odd, but it had happened before. I dug a bit further and found a screenful of SCSI errors. This was bad – something must have gone wrong with the storage.

Next I checked the arrays. The first one I checked had numerous errors on it; failed disks, missing disks, and drive not ready messages. I can’t stress enough how important this data is – it holds files, email and shared areas for all the staff and researchers in our department, and I really didn’t want to explain to them that we’d lost it all (well, we do have backups…). I nervously moved on to check the second array – they were mirrored, so as long as one was OK we’d be fine – and I was delighted to find no error messages.

So, now I knew that the likely cause of the problems was an array failing. It turned out later on to be the controller in this array, which was a good thing because Sun managed to send the wrong disks anyway. The next steps were to get the array fixed and to get things back online. I asked my colleague, who was already in the office, to disconnect the fibres from the failed array (to keep it completely out of the loop whilst it was fixed) and get on to Sun to fix it. Whilst he did that I, still at home, not dressed, and without breakfast, got on with getting things back online.

This, in theory, should have been the easy part. We had a mirrored setup so the plan was to just bring the volume back online with only half of the mirror. No problem, I thought. Except when it wouldn’t come online. When the initial problem had occured the cluster software (VCS) had failed to unmount the disk from the node it was on. It had decided that it needed to do this to bring it online on another node (little did it know that it wouldn’t work on any other node either), so as a last resort it asked the machine to panic. This is something akin to asking it to commit suicide. It duely did it’s job, but in the process left the disks in an odd state.

When I tried to mount these disks on one of the other nodes I got errors from the volume manager telling me a split brain had occured (this happens when a live cluster splits in two, but neither half can see the other). I knew that wasn’t the case, so I tried to force the mount. That failed with write errors. After a lot of head scratching I realised it was probably the I/O fencing stopping this node from accessing the disk. Whilst frustrating, it was nice to see the software behaving as it should – in a real split brain situation this is exactly what you want.

A while later I figured out how to clear the SCSI3 reservations on the disks (-o clearreserve option to vxdg import). This was nearly enough. Another issue with the split brain was that the configuration data stored on the disks didn’t quite match (I’m not 100% sure why, but I believe the node that paniced hadn’t managed to consistently update the metadata). After dumping the configuration it was clear that they were identical, bar a revision number, so by using -o selectcp we were able to get the diskgroup imported.

vxdg -fC -o clearreserve -o selectcp=1128804183.107.qetesh \
    import ResFS

Success! The diskgroup was online. From here it was just a case of waiting for fsck to confirm everything looked OK and then unleashing VCS to bring the service group back online.

By this point Sun had sent out an engineer and parts to fix the other array (we get a good service from them, thankfully). That’s currently resyncing its disks, which will take a day or two. Once that’s done we’ll hook it back in to the fibre fabric and bring things back online. It’ll take just as long again to resync the data, but all I have to do is sit and watch 🙂

Finally, after hours of investigation I finally found out the cause of all the problems. We’ve just ordered a newer, bigger array. The old ones are just jealous.

(And a quick thanks to Pete for his help in debugging things this morning 🙂 )