Archive for the ‘Work’ Category

CSProjects Upgrade

Thursday, July 24th, 2008 in Work

Since launching CSProjects last year we’ve had nearly 80 projects set up on it. I’m pleased with the success it’s had so far, but I’m still surprised by people not knowning about it. Recently I’ve been trying to do more to publicise it, with some success.

Over the past couple of weeks I’ve spent some time sorting out the upgrade to Trac 0.11. It’s not significantly different from the 0.10.4 version that we’ve been running for the past year, but it did require some config and code changes. We’ve rolled that out today, and so far there haven’t been any problems.

Next on the agenda is upgrading Subversion to 1.5. Hopefully that’ll be fairly easy :-)

A solution? It’s all Sun’s fault.

Tuesday, July 22nd, 2008 in Work

I’ve made a couple of posts lately about the problems we’ve been having with one of our Sun 3511 disk arrays. Sun got back to me today with what they thought the problem was. Here’s the gory detail (slightly truncated to fit this blog post):

* SES

Ch  Id Chassis Vendor/Product ID        Rev  PLD
-------------------------------------------------
 2  28 092131  SUN StorEdge 3511F A     0430 1000
 3  28 092131  SUN StorEdge 3511F D     0420 1100*

* indicates SES or PLD firmware mismatch.

This appears in a few places, so I’ve only quoted the one Sun spotted. It boils down to the components in the array having inconsistent firmware revisions. This could very well have caused the crash we saw yesterday.

This is something I blame Sun for. Firstly, they shipped out a controller with mismatched firmwares on it. I guess this sort of thing might happen though, but the field engineer should really have spotted the mistake when he was onsite and getting the replacement controller configured.

Today Sun wanted to send out another engineer to get the firmware updated, and when I came back from shifting some stuff around I had a voicemail from dispatch. It’s good to see them being so proactive at fixing the issue, although I wonder if it’s because they realised it was their fault?

However, being as I am a sysadmin, I figured I could save everyone a lot of time and hassle if I did the upgrade myself. Sun gave me the link and a couple of hours later it was all sorted (and I even managed to shift some furniture around the building in the middle of it). We’ve agreed with Sun to wait until next week to ensure things are working as they should, then we’ll close the case.

In the meantime I have our largest volume resyncing, but have some familiar looking problems cropping up with the other. Somehow I fear this saga isn’t quite over yet…

Sigh. Stupid array.

Monday, July 21st, 2008 in Work

After all the fun the other day I was hoping for some time to work on other stuff this week. By the end of the weekend the array had finishing syncing and I’d remirrored all the volumes back on to it. It was all ticking over nicely, until this morning…

Unrecoverable Controller Error Encountered !
Resetting Controller !

After which the array disappeared. On arrival at work I power cycled the unit and it came back up without any problems (albeit needing resyncing, again). But this isn’t good enough - the unit is still faulty.

So I’ve logged the case with Sun, again. They’re being remarkably slow to respond today, despite me logging it as high priority. But, as you’d expect, they just got it within their 4 hour SLA window.

So it continues…

“Any idea WTF is going on?”

Wednesday, July 16th, 2008 in Computing, Work

“Any idea WTF is going on?” is what I read on my phone as I stumbled out of bed this morning. It was from one of my colleagues who, for some reason I can’t understand, seems to like getting in to work at a ridiculous hour in the morning.

Still half asleep I plodded through to my desk and sat down at my computer. I tried to check my email but nothing was responding. Then I saw the message “NFS server resfs.fs.cs not responding”… and woke up rather quickly. This meant either our network was shafted, or more likely, the cluster had blown up again.

I discovered one of the cluster nodes was offline and marked as failed, and the service group that manages our filestore was also marked as failed. That was odd, but it had happened before. I dug a bit further and found a screenful of SCSI errors. This was bad - something must have gone wrong with the storage.

Next I checked the arrays. The first one I checked had numerous errors on it; failed disks, missing disks, and drive not ready messages. I can’t stress enough how important this data is - it holds files, email and shared areas for all the staff and researchers in our department, and I really didn’t want to explain to them that we’d lost it all (well, we do have backups…). I nervously moved on to check the second array - they were mirrored, so as long as one was OK we’d be fine - and I was delighted to find no error messages.

So, now I knew that the likely cause of the problems was an array failing. It turned out later on to be the controller in this array, which was a good thing because Sun managed to send the wrong disks anyway. The next steps were to get the array fixed and to get things back online. I asked my colleague, who was already in the office, to disconnect the fibres from the failed array (to keep it completely out of the loop whilst it was fixed) and get on to Sun to fix it. Whilst he did that I, still at home, not dressed, and without breakfast, got on with getting things back online.

This, in theory, should have been the easy part. We had a mirrored setup so the plan was to just bring the volume back online with only half of the mirror. No problem, I thought. Except when it wouldn’t come online. When the initial problem had occured the cluster software (VCS) had failed to unmount the disk from the node it was on. It had decided that it needed to do this to bring it online on another node (little did it know that it wouldn’t work on any other node either), so as a last resort it asked the machine to panic. This is something akin to asking it to commit suicide. It duely did it’s job, but in the process left the disks in an odd state.

When I tried to mount these disks on one of the other nodes I got errors from the volume manager telling me a split brain had occured (this happens when a live cluster splits in two, but neither half can see the other). I knew that wasn’t the case, so I tried to force the mount. That failed with write errors. After a lot of head scratching I realised it was probably the I/O fencing stopping this node from accessing the disk. Whilst frustrating, it was nice to see the software behaving as it should - in a real split brain situation this is exactly what you want.

A while later I figured out how to clear the SCSI3 reservations on the disks (-o clearreserve option to vxdg import). This was nearly enough. Another issue with the split brain was that the configuration data stored on the disks didn’t quite match (I’m not 100% sure why, but I believe the node that paniced hadn’t managed to consistently update the metadata). After dumping the configuration it was clear that they were identical, bar a revision number, so by using -o selectcp we were able to get the diskgroup imported.

vxdg -fC -o clearreserve -o selectcp=1128804183.107.qetesh \
    import ResFS

Success! The diskgroup was online. From here it was just a case of waiting for fsck to confirm everything looked OK and then unleashing VCS to bring the service group back online.

By this point Sun had sent out an engineer and parts to fix the other array (we get a good service from them, thankfully). That’s currently resyncing its disks, which will take a day or two. Once that’s done we’ll hook it back in to the fibre fabric and bring things back online. It’ll take just as long again to resync the data, but all I have to do is sit and watch :-)

Finally, after hours of investigation I finally found out the cause of all the problems. We’ve just ordered a newer, bigger array. The old ones are just jealous.

(And a quick thanks to Pete for his help in debugging things this morning :-) )

CSProjects is unleashed (at last)

Thursday, October 18th, 2007 in Computing, Work

I started working on CSProjects quite a few months ago.

Problems started early on. I began by bringing our software up-to-date. This included Apache, Python, Subversion, Trac and mod_python. It took some time, but I didn’t experience any problems… until I tried to run them. Seemingly at random, but quite frequently, the Apache children would get a Bus Error. I googled around and discovered this was a fairly common problem, but none of the solutions (mostly involving library versions, particularly expat) seemed to make any difference.

After a few weeks of recompiling, stripping things down to the bare bones, turning on debugging and staring endlessly in to the output of gdb, I struck upon a solution. And annoyingly it wasn’t in any of the things I’d be staring it, but instead it appeared in the form of mod_wsgi. This wonderful piece of code does a similar job to mod_python, so I dropped it in and hoped for the best. Nope, it still crashed. But what saved me was the documentation - the author wrote, and I quote:

Do note though that some versions of the Subversion Python bindings apparently have problems when being used from within secondary Python sub interpreters rather than the main Python interpreter. The result of this will be strange Python exceptions or the Apache child processes could even crash.

To avoid such problems, the Trac application should be forced to run within the main Python interpreter. This can be done using the WSGIApplicationGroup directive with the value ‘%{GLOBAL}’.

This was precisely my problem. So I did as suggested and to much relief everything worked. And mod_wsgi a’int half good too… in my opinion it’s much better than mod_python.

At this point I had all the software working. So I took a month off. Literally.

When I returned I had to move all of our frickin’ servers. But after that I got back to CSProjects.

With the help of Adam Sampson I got down to the business of bringing these software packages together in to something we could offer our users. We did a lot of coding and a few weeks later the final CSProjects was published. Then we had to change it all and another week later CSProjects was published again. Today we launched it to our users and we already have a whole bunch of people using it. Which brings a satisfying end to a few months of work.

Oh, and the logo. I did that (mostly - I got a little bit of help). Sometimes the simple things work best…

CSProjects

Bad things come in fives.

Wednesday, March 21st, 2007 in Work

Thursday 22 February. That’s the day it all went wrong.

I was on my way home from a shopping trip at Sainsburys. We’d been on a Thursday instead of a Friday because we had to go to Cornwall on the Friday afternoon for a funeral (that’s bad thing number one). Just after leaving Sainsburys we came across an area that was unusually dark. I guessed this was a power cut, so I was beginning to worry about the state of things at work. Then we hit an area that was lit up, then another that was dark, then another that was lit up. Unfortunately we got to University to find it in darkness. That’s bad thing number two.

I sent Ruth home with the shopping and headed in to see what I could do. Our core equipment has a couple of hours UPS provision and the non-core stuff has virtually no UPS provision, so there was no immediate rush - the machines were either fine or already off. This was when the third bad thing happen - the card locks had failed. Although I could get in to the Octagon lobby I was unable to get anywhere else in the building. After some wandering around the caretaker managed to find a way to the machine room bypassing the card locks, but this still left the challenge of entering the room itself.

I waited for someone else to show up, hopefully with a key for the machine room, but nobody did. In the end I gave up and went home to try and use the computer there to reach people. Success! Someone from operations was coming in with a key. I grabbed a quick bit of food and rushed out of the door to meet them. Suddenly I felt a seering pain in my finger. I glanced at it and wondered why it was purple and throbbing with pain. I’d stupidly managed to catch it in my large metal front door. There’s problem number four. After a few minutes of putting ice on it I headed back up to work.

The next part of the story is a bit dull. I shut everything down and went home to sleep. I’d figured I was better off coming in early the next day rather than staying late, particularly with the state of my finger. 7am next morning I headed back.

It took a good few hours to get everything going again. There were plenty of problems with the cluster which all turned out to be related to the last bad thing. The main disk array (a Sun StorEdge 3510) is a fully redundant unit - it has dual controllers, dual PSUs, dual fibre links, and a RAID 5 disk configuration. One of the power supplies is linked to our UPS and the other to the main machine room UPS. In a power failure the machine room UPS lasts only a few minutes so we drop down to only one PSU. So it would help if that PSU was functioning. It all appeared fine - lights were on, fans whiring - but the unit just wouldn’t power on with just that single PSU. Consequently when the power cut occured that disk array went down.

That problem was compounded by the way we’d set up our coordinator disks. We originally only had this one array so we had three coordinator disks on the one array. Later when we added two more arrays we added a coordinator disk on each of those but fataly left three on the original array. When this array turned off we were left with less than a majority of coordinator disks which caused all the cluster nodes to panic - they assumed they were cut off from the main part of the cluster. So even the services that should have remained running were killed off.

The disk array powering off also caused corruption on one of our filesystems. This was easily fixed with a fsck (thank goodness we had VxFS, otherwise all our ACLs would have vanished - last time I checked that was still a problem with UFS). It had also corrupted our MySQL databases, but after some rather long checks these were also fixed. One of the MySQL databases handled our email services, so we had a whole bunch of problems there too.

We’ve now had the PSU replaced by Sun, with unusually little fuss, so maybe it’s a known problem. But we can’t test it properly without potentially killing the array again, so we’re holding off doing it until we next power everything down. I’ve swapped the power cables over so another power failure shouldn’t land us in such a mess.

Isn’t it great when your highly available systems work? ;-)

NFS Performance, concluded

Tuesday, February 20th, 2007 in Work

Back in the middle of last year I wrote about our plans to tackle our NFS performance issues by introducing a direct and dedicated network link to carry our NFS traffic between the clients and the servers. We’d done the tests so we just had to implement it.

First we waited for the financial year (1 August) to roll over so we had some money, then we went and purchased 2 gigabit switches and 4 quad port network cards for the client machines. We only really needed dual port cards, but the supplier gave a choice of single or quad. The server nodes, which are part of our cluster, already had sufficient ports (8 per node, 6 of which are now used!).

You’d think it’d be pretty simple from there? It wasn’t. We hit a couple of snags:

  1. How do we get the cables directly from one cabinet to another? All our cabling patches back to a central network cabinet, but we don’t have enough patch panel ports to send them all there and back.
  2. What colour cables should we use? We can’t use the normal colour!

After much deliberation other jobs came along and consumed my time for the next few months.

So, a couple of months ago we took the plunge and made an important decision. We got blue cables. This would avoid confusion with our yellow cables (normal network), purple cables (crossover network), red cables (serial network), green cables (serial rolled) or grey cables (to be burnt alive). We also noticed some handy holes in the tops of the cabinets and neatly threaded the link cables to the servers through there.

With the hard work done we set about putting the new cards in to the client machines. The first machine was a good test case and I spent a while sorting our configuration and automounter setup to deal with the new link nicely. The remaining machines I’ve done on Tuesday mornings this month, with the last one being done today.

This leaves us with the important question. Is it actually quicker? My raw tests prove it is - ping times are halved and times to transfer large files are also halved. Loading my email in mutt is much quicker, as is listing my overly populated home directory.

Sorted then? Not quite - we still have a few users with Exmh slowdown problems. We’re still investigating that. Maybe Exmh is doing something pathalogically slow that doesn’t agree with NFS. Or maybe it’s just getting a bit slow in its twilight years. I’ll leave that to the boss (the only Exmh user in our group) to figure out ;-)

NFS Performance, continued

Thursday, July 13th, 2006 in Computing, Work

Back in May I wrote about the performance problems we were having with our new NFS based user filestore. It’s been a while since then, and the problems have continued. We have noticed that it appears to be load related - not just the network, but also the machine. This suggests that our theories about IPsec causing the slow down may be correct.

Our original plan was to try a private network which would remove the need for IPsec and also remove any latency added by routing the traffic between our subnets. This still seemed like a good plan, so I asked around and another department kindly lent us a brand new gigabit switch. We’ve connected this to one of our NFS clients and to the cluster node that’s currently running our filestore.

So far we’ve noticed some serious performance boosts. There’s only a few of us using it, so it could just be that it’s a lightly loaded connection - time will tell on that one. The bottom line is that it seems to be quicker than the IPsec connection ever was, so hopefully we’re on to a winner. We’ve also got a few staff testing it out, and their responses have been positive so far.

The next step after this testing period is to look at the costs of doing this properly with our own equipment. One of the key things we’ve been doing recently is increasing the redundancy of our systems, so it’d be fairly daft to do this with just one switch. We’d need at least two, with every cluster node connected to both, and every client that we want optimum performance on connected to both. Obviously there’ll be other clients that are less important and they can continue to use the existing infrastructure.

Of course, I’ve got absolutely no idea where we’ll put these switches, or how we’ll wire them in - things are pretty tight in our racks at the moment. Suppose there’s got to be a challenge somewhere :-)

My only worry with all this is what we’ll do if it doesn’t work. I don’t have any other ideas that’d make it go quicker - to be frank, you can’t really get any quicker than a directly connected switch. Lets hope we don’t have to worry about it.

The end of an era, or two

Tuesday, June 6th, 2006 in Work

This week we’ve finally seen the end of some things I’ve been trying to sort out for some time now.

  • The old storage arrays (Sun T3s and A1000s) are finally gone. The T3 arrays in particular have caused us endless grief over the past few years, so I’m more than happy to see them go. It also marks the end of a year long project to centralise our filestore on our resilient cluster. No more losing access to our files when one machine goes down :-)
  • Our last Solaris 8 machine has been decommissioned. We’d stopped supporting it a while ago, but this finally puts the nail in the coffin. More importantly it means I can focus on moving towards Solaris 10, which I hadn’t done until now because I didn’t want to be running 3 different versions of Solaris at once!
  • We’ve finally removed the last non-rackmountable machine from the racks. Actually, it wasn’t even in our racks, so it means we’re now entirely self-contained within our own area. This is something I’ve been trying to do for many years.

So I’m now spending some time looking at Solaris 10 and trying to see how we can integrate the new “features” in to our existing systems. The main problem areas seem to be the service management stuff. I’ll undoubtedly post more about that in the future.

NFS+IPsec Performance

Friday, May 12th, 2006 in Computing, Work

We’ve recently moved to having our filestore NFS exported from a cluster. This provides almost complete resilience from hardware failures, and moves us away from depending on individual end-user systems with locally attached filestore.

Given the inherent insecurities with NFS we opted to use IPsec authentication (but not encryption) between the hosts involved. The NFS server only accepts connections from a list of hosts, and we know those hosts are who they say they are by relying on the IPsec authentication. We’ve also made it use privileged ports to ensure local users don’t try any spoofing :-)

The trade-off here appears to be latency. I’ve done some completely unscientific tests that involved shovelling UDP data at a fixed rate between two machines. These are the ”jitter” figures they produced:

  • 0.10ms - direct
  • 0.30ms - via router
  • 0.70ms - via router with IPsec

Bear in mind that those figures might not bear any relation to the latencies involved with NFS packets, but it should give an idea of the relative delays added by routing and IPsec.

We could, to some extent, reduce those figures by replacing hardware. Quicker routers would undoubtedly remove some of the routing latency, and quicker machines could perform the IPsec calculations faster. But this probably isn’t the cheapest solution.

The first test I want to try is adding a private network between the NFS server and NFS client, with no routing involved. Seeing as it’s private we can reasonably trust that people won’t be able to spoof packets on that network and remove the IPsec authentication. In theory, these differences could signficantly reduce the latencies involved.

We’ll continue to monitor this for a while first, though. We need to keep an eye on loading on the NFS server, network usage, and so on. But, at the moment, it seems likely the problems are in the network part of NFS communication process.