Archive for the ‘Work’ Category

CSProjects is unleashed (at last)

Thursday, October 18th, 2007 in Computing, Work

I started working on CSProjects quite a few months ago.

Problems started early on. I began by bringing our software up-to-date. This included Apache, Python, Subversion, Trac and mod_python. It took some time, but I didn’t experience any problems… until I tried to run them. Seemingly at random, but quite frequently, the Apache children would get a Bus Error. I googled around and discovered this was a fairly common problem, but none of the solutions (mostly involving library versions, particularly expat) seemed to make any difference.

After a few weeks of recompiling, stripping things down to the bare bones, turning on debugging and staring endlessly in to the output of gdb, I struck upon a solution. And annoyingly it wasn’t in any of the things I’d be staring it, but instead it appeared in the form of mod_wsgi. This wonderful piece of code does a similar job to mod_python, so I dropped it in and hoped for the best. Nope, it still crashed. But what saved me was the documentation – the author wrote, and I quote:

Do note though that some versions of the Subversion Python bindings apparently have problems when being used from within secondary Python sub interpreters rather than the main Python interpreter. The result of this will be strange Python exceptions or the Apache child processes could even crash.

To avoid such problems, the Trac application should be forced to run within the main Python interpreter. This can be done using the WSGIApplicationGroup directive with the value ‘%{GLOBAL}’.

This was precisely my problem. So I did as suggested and to much relief everything worked. And mod_wsgi a’int half good too… in my opinion it’s much better than mod_python.

At this point I had all the software working. So I took a month off. Literally.

When I returned I had to move all of our frickin’ servers. But after that I got back to CSProjects.

With the help of Adam Sampson I got down to the business of bringing these software packages together in to something we could offer our users. We did a lot of coding and a few weeks later the final CSProjects was published. Then we had to change it all and another week later CSProjects was published again. Today we launched it to our users and we already have a whole bunch of people using it. Which brings a satisfying end to a few months of work.

Oh, and the logo. I did that (mostly – I got a little bit of help). Sometimes the simple things work best…

CSProjects

  • Share/Bookmark

Bad things come in fives.

Wednesday, March 21st, 2007 in Work

Thursday 22 February. That’s the day it all went wrong.

I was on my way home from a shopping trip at Sainsburys. We’d been on a Thursday instead of a Friday because we had to go to Cornwall on the Friday afternoon for a funeral (that’s bad thing number one). Just after leaving Sainsburys we came across an area that was unusually dark. I guessed this was a power cut, so I was beginning to worry about the state of things at work. Then we hit an area that was lit up, then another that was dark, then another that was lit up. Unfortunately we got to University to find it in darkness. That’s bad thing number two.

I sent Ruth home with the shopping and headed in to see what I could do. Our core equipment has a couple of hours UPS provision and the non-core stuff has virtually no UPS provision, so there was no immediate rush – the machines were either fine or already off. This was when the third bad thing happen – the card locks had failed. Although I could get in to the Octagon lobby I was unable to get anywhere else in the building. After some wandering around the caretaker managed to find a way to the machine room bypassing the card locks, but this still left the challenge of entering the room itself.

I waited for someone else to show up, hopefully with a key for the machine room, but nobody did. In the end I gave up and went home to try and use the computer there to reach people. Success! Someone from operations was coming in with a key. I grabbed a quick bit of food and rushed out of the door to meet them. Suddenly I felt a seering pain in my finger. I glanced at it and wondered why it was purple and throbbing with pain. I’d stupidly managed to catch it in my large metal front door. There’s problem number four. After a few minutes of putting ice on it I headed back up to work.

The next part of the story is a bit dull. I shut everything down and went home to sleep. I’d figured I was better off coming in early the next day rather than staying late, particularly with the state of my finger. 7am next morning I headed back.

It took a good few hours to get everything going again. There were plenty of problems with the cluster which all turned out to be related to the last bad thing. The main disk array (a Sun StorEdge 3510) is a fully redundant unit – it has dual controllers, dual PSUs, dual fibre links, and a RAID 5 disk configuration. One of the power supplies is linked to our UPS and the other to the main machine room UPS. In a power failure the machine room UPS lasts only a few minutes so we drop down to only one PSU. So it would help if that PSU was functioning. It all appeared fine – lights were on, fans whiring – but the unit just wouldn’t power on with just that single PSU. Consequently when the power cut occured that disk array went down.

That problem was compounded by the way we’d set up our coordinator disks. We originally only had this one array so we had three coordinator disks on the one array. Later when we added two more arrays we added a coordinator disk on each of those but fataly left three on the original array. When this array turned off we were left with less than a majority of coordinator disks which caused all the cluster nodes to panic – they assumed they were cut off from the main part of the cluster. So even the services that should have remained running were killed off.

The disk array powering off also caused corruption on one of our filesystems. This was easily fixed with a fsck (thank goodness we had VxFS, otherwise all our ACLs would have vanished – last time I checked that was still a problem with UFS). It had also corrupted our MySQL databases, but after some rather long checks these were also fixed. One of the MySQL databases handled our email services, so we had a whole bunch of problems there too.

We’ve now had the PSU replaced by Sun, with unusually little fuss, so maybe it’s a known problem. But we can’t test it properly without potentially killing the array again, so we’re holding off doing it until we next power everything down. I’ve swapped the power cables over so another power failure shouldn’t land us in such a mess.

Isn’t it great when your highly available systems work? ;-)

  • Share/Bookmark

NFS Performance, concluded

Tuesday, February 20th, 2007 in Work

Back in the middle of last year I wrote about our plans to tackle our NFS performance issues by introducing a direct and dedicated network link to carry our NFS traffic between the clients and the servers. We’d done the tests so we just had to implement it.

First we waited for the financial year (1 August) to roll over so we had some money, then we went and purchased 2 gigabit switches and 4 quad port network cards for the client machines. We only really needed dual port cards, but the supplier gave a choice of single or quad. The server nodes, which are part of our cluster, already had sufficient ports (8 per node, 6 of which are now used!).

You’d think it’d be pretty simple from there? It wasn’t. We hit a couple of snags:

  1. How do we get the cables directly from one cabinet to another? All our cabling patches back to a central network cabinet, but we don’t have enough patch panel ports to send them all there and back.
  2. What colour cables should we use? We can’t use the normal colour!

After much deliberation other jobs came along and consumed my time for the next few months.

So, a couple of months ago we took the plunge and made an important decision. We got blue cables. This would avoid confusion with our yellow cables (normal network), purple cables (crossover network), red cables (serial network), green cables (serial rolled) or grey cables (to be burnt alive). We also noticed some handy holes in the tops of the cabinets and neatly threaded the link cables to the servers through there.

With the hard work done we set about putting the new cards in to the client machines. The first machine was a good test case and I spent a while sorting our configuration and automounter setup to deal with the new link nicely. The remaining machines I’ve done on Tuesday mornings this month, with the last one being done today.

This leaves us with the important question. Is it actually quicker? My raw tests prove it is – ping times are halved and times to transfer large files are also halved. Loading my email in mutt is much quicker, as is listing my overly populated home directory.

Sorted then? Not quite – we still have a few users with Exmh slowdown problems. We’re still investigating that. Maybe Exmh is doing something pathalogically slow that doesn’t agree with NFS. Or maybe it’s just getting a bit slow in its twilight years. I’ll leave that to the boss (the only Exmh user in our group) to figure out ;-)

  • Share/Bookmark

NFS Performance, continued

Thursday, July 13th, 2006 in Computing, Work

Back in May I wrote about the performance problems we were having with our new NFS based user filestore. It’s been a while since then, and the problems have continued. We have noticed that it appears to be load related – not just the network, but also the machine. This suggests that our theories about IPsec causing the slow down may be correct.

Our original plan was to try a private network which would remove the need for IPsec and also remove any latency added by routing the traffic between our subnets. This still seemed like a good plan, so I asked around and another department kindly lent us a brand new gigabit switch. We’ve connected this to one of our NFS clients and to the cluster node that’s currently running our filestore.

So far we’ve noticed some serious performance boosts. There’s only a few of us using it, so it could just be that it’s a lightly loaded connection – time will tell on that one. The bottom line is that it seems to be quicker than the IPsec connection ever was, so hopefully we’re on to a winner. We’ve also got a few staff testing it out, and their responses have been positive so far.

The next step after this testing period is to look at the costs of doing this properly with our own equipment. One of the key things we’ve been doing recently is increasing the redundancy of our systems, so it’d be fairly daft to do this with just one switch. We’d need at least two, with every cluster node connected to both, and every client that we want optimum performance on connected to both. Obviously there’ll be other clients that are less important and they can continue to use the existing infrastructure.

Of course, I’ve got absolutely no idea where we’ll put these switches, or how we’ll wire them in – things are pretty tight in our racks at the moment. Suppose there’s got to be a challenge somewhere :-)

My only worry with all this is what we’ll do if it doesn’t work. I don’t have any other ideas that’d make it go quicker – to be frank, you can’t really get any quicker than a directly connected switch. Lets hope we don’t have to worry about it.

  • Share/Bookmark

The end of an era, or two

Tuesday, June 6th, 2006 in Work

This week we’ve finally seen the end of some things I’ve been trying to sort out for some time now.

  • The old storage arrays (Sun T3s and A1000s) are finally gone. The T3 arrays in particular have caused us endless grief over the past few years, so I’m more than happy to see them go. It also marks the end of a year long project to centralise our filestore on our resilient cluster. No more losing access to our files when one machine goes down :-)
  • Our last Solaris 8 machine has been decommissioned. We’d stopped supporting it a while ago, but this finally puts the nail in the coffin. More importantly it means I can focus on moving towards Solaris 10, which I hadn’t done until now because I didn’t want to be running 3 different versions of Solaris at once!
  • We’ve finally removed the last non-rackmountable machine from the racks. Actually, it wasn’t even in our racks, so it means we’re now entirely self-contained within our own area. This is something I’ve been trying to do for many years.

So I’m now spending some time looking at Solaris 10 and trying to see how we can integrate the new “features” in to our existing systems. The main problem areas seem to be the service management stuff. I’ll undoubtedly post more about that in the future.

  • Share/Bookmark

NFS+IPsec Performance

Friday, May 12th, 2006 in Computing, Work

We’ve recently moved to having our filestore NFS exported from a cluster. This provides almost complete resilience from hardware failures, and moves us away from depending on individual end-user systems with locally attached filestore.

Given the inherent insecurities with NFS we opted to use IPsec authentication (but not encryption) between the hosts involved. The NFS server only accepts connections from a list of hosts, and we know those hosts are who they say they are by relying on the IPsec authentication. We’ve also made it use privileged ports to ensure local users don’t try any spoofing :-)

The trade-off here appears to be latency. I’ve done some completely unscientific tests that involved shovelling UDP data at a fixed rate between two machines. These are the ”jitter” figures they produced:

  • 0.10ms – direct
  • 0.30ms – via router
  • 0.70ms – via router with IPsec

Bear in mind that those figures might not bear any relation to the latencies involved with NFS packets, but it should give an idea of the relative delays added by routing and IPsec.

We could, to some extent, reduce those figures by replacing hardware. Quicker routers would undoubtedly remove some of the routing latency, and quicker machines could perform the IPsec calculations faster. But this probably isn’t the cheapest solution.

The first test I want to try is adding a private network between the NFS server and NFS client, with no routing involved. Seeing as it’s private we can reasonably trust that people won’t be able to spoof packets on that network and remove the IPsec authentication. In theory, these differences could signficantly reduce the latencies involved.

We’ll continue to monitor this for a while first, though. We need to keep an eye on loading on the NFS server, network usage, and so on. But, at the moment, it seems likely the problems are in the network part of NFS communication process.

  • Share/Bookmark

Erm, whoops?

Friday, May 12th, 2006 in Work

I’d finally finished migrating everything off the old myrtle disk arrays, so I was feeling quite pleased. I’d just unplugged the last array from myrtle and plugged it in to the test machine for wiping. Then I tried to log in to the machine room SunRay, but strangely it didn’t work.

I checked the console logs for myrtle and was surprised to see it counting “12%… 13%… 14%”. I glanced up and saw my colleagues attempting to come in to the machine room and tell me something, but for some reason were unable to open the door. Scrolling back over the console logs I saw what it was up to:

panic[cpu2]/thread=2a100105d40: md: Panic due to lack of DiskSuite state database replicas. Fewer than 50% of the total were available, so panic to ensure data integrity.

That made immediate sense to me, and I gave myself a bit of a kick. The RAID system we use for internal disks, DiskSuite (actually Volume Manager now, but it seems they haven’t updated this error message), has state databases stored on every disk. On myrtle we had 6 – two on the internal disks, and one on each of the four disk arrays. You need at least 50% for things to work.

A week or so ago I removed the first pair of arrays without any problems. At that point we had 4 out of 6 databases. Today I removed the last 2 giving us only 2 remaining, which is less than 50%, and the machine dutifully paniced itself.

Fixing it was made tricky by the fact that it could no longer mount the root filesystem because the RAID wouldn’t start. Thankfully the arrays were still to hand, so I just plugged them back in. After booting I removed the databases from the arrays, and added an additional one on each of the internal disks – this gives us 4 in total, 2 on each disk, which is what we normally do.

I also used the handy opportunity to mount the new filestore directly on /home and /proj, rather than using symlinks.

I’ll end this post with a bit of a rant. I can understand why the system won’t boot with less than 50% of the state databases – it has no way of knowing if they represent the correct state of things. But, what I don’t understand is why it needs to panic the system when it has less than 50%. It knows the remaining ones are valid because they’re currently in use. In fact panicing just makes it harder for the sysadmin to deal with the problem. Or am I missing something?

  • Share/Bookmark

A new web site

Thursday, May 11th, 2006 in Work

The Computer Science department got it’s new website (more of a re-skin, actually) yesterday, so I decided it was about time to update my staff page.

I’ve brought it in-line with the new look of the main website, although it doesn’t completely follow the standard templates. It’s also XHTML 1.1 valid, which makes the pedantic side of me feel much better.

There’s not much else to say – go take a look (unless you’re already there), and let me know what you think.

  • Share/Bookmark

Strange kerberos problems

Tuesday, May 9th, 2006 in Work

A few days ago one of our users reported that they couldn’t change their password. The error coming out of the passwd command was confusing in itself – it said ‘bad old password’, or similar, which turns out to be a bug in our wrapper script.

After some investigation we discovered that neither kadmin or kpasswd worked:

tdb [~] % kadmin -p tdb/admin
Enter Password:
kadmin: Operation failed for unspecified reason while
initializing kadmin interface
tdb [~] % kpasswd
kpasswd: Changing password for tdb.
Old password:
kpasswd: Cannot establish a session with the Kerberos
administrative server for realm CS.UKC.AC.UK.
Operation failed for unspecified reason.

The completely unhelpful bit there is the “failed for unspecified reason” error message. How are you meant to even begin debugging that? After a couple of hours digging I logged the call with Sun.

It turns out that there is a known bug:

Document ID:6410919
Title:Patch 112908-24 will cause the kadmin -p kws/admin to exit with a error message

The solution presented was to remove patch 112908-24. This time I’m willing to do that, but from past experience I’d like to see them actually fix the problem rather than just back it out. Or, at the very least, remove the patch from cluster patches. Otherwise in 6 months time I’m left staring at the same problem.

What I’ve found most interesting in all this is that it took the best part of a month for anyone to notice passwords couldn’t be changed :-)

  • Share/Bookmark

The end of the T3 saga

Friday, May 5th, 2006 in Work

So after copying everyone off the limping T3 arrays I arranged for a Sun engineer to return to site to fix it properly. Sun Dispatch had a bit of a moan because I’d had the parts for too long, but they realised it’d make most sense to keep the parts on site rather than collect them and then send them back to me :-)

After replacing a loop card in the secondary unit the problems mostly went away. I guess this makes sense since it was the second unit we were having trouble connecting to. The engineer also replaced the primary controller which fixed the remaining minor problems. Finally we had a working array.

I’ve now wiped the disks, updated the firmware (not really sure why I did that!), and cleared the NVRAM. They’re all ready to pass on to someone else now.

So, just the failing cluster node and broken KDC to deal with… :-(

  • Share/Bookmark