Archive for the ‘Work’ Category

NFS Performance, continued

Thursday, July 13th, 2006 in Computing, Work

Back in May I wrote about the performance problems we were having with our new NFS based user filestore. It’s been a while since then, and the problems have continued. We have noticed that it appears to be load related - not just the network, but also the machine. This suggests that our theories about IPsec causing the slow down may be correct.

Our original plan was to try a private network which would remove the need for IPsec and also remove any latency added by routing the traffic between our subnets. This still seemed like a good plan, so I asked around and another department kindly lent us a brand new gigabit switch. We’ve connected this to one of our NFS clients and to the cluster node that’s currently running our filestore.

So far we’ve noticed some serious performance boosts. There’s only a few of us using it, so it could just be that it’s a lightly loaded connection - time will tell on that one. The bottom line is that it seems to be quicker than the IPsec connection ever was, so hopefully we’re on to a winner. We’ve also got a few staff testing it out, and their responses have been positive so far.

The next step after this testing period is to look at the costs of doing this properly with our own equipment. One of the key things we’ve been doing recently is increasing the redundancy of our systems, so it’d be fairly daft to do this with just one switch. We’d need at least two, with every cluster node connected to both, and every client that we want optimum performance on connected to both. Obviously there’ll be other clients that are less important and they can continue to use the existing infrastructure.

Of course, I’ve got absolutely no idea where we’ll put these switches, or how we’ll wire them in - things are pretty tight in our racks at the moment. Suppose there’s got to be a challenge somewhere :-)

My only worry with all this is what we’ll do if it doesn’t work. I don’t have any other ideas that’d make it go quicker - to be frank, you can’t really get any quicker than a directly connected switch. Lets hope we don’t have to worry about it.

The end of an era, or two

Tuesday, June 6th, 2006 in Work

This week we’ve finally seen the end of some things I’ve been trying to sort out for some time now.

  • The old storage arrays (Sun T3s and A1000s) are finally gone. The T3 arrays in particular have caused us endless grief over the past few years, so I’m more than happy to see them go. It also marks the end of a year long project to centralise our filestore on our resilient cluster. No more losing access to our files when one machine goes down :-)
  • Our last Solaris 8 machine has been decommissioned. We’d stopped supporting it a while ago, but this finally puts the nail in the coffin. More importantly it means I can focus on moving towards Solaris 10, which I hadn’t done until now because I didn’t want to be running 3 different versions of Solaris at once!
  • We’ve finally removed the last non-rackmountable machine from the racks. Actually, it wasn’t even in our racks, so it means we’re now entirely self-contained within our own area. This is something I’ve been trying to do for many years.

So I’m now spending some time looking at Solaris 10 and trying to see how we can integrate the new “features” in to our existing systems. The main problem areas seem to be the service management stuff. I’ll undoubtedly post more about that in the future.

NFS+IPsec Performance

Friday, May 12th, 2006 in Computing, Work

We’ve recently moved to having our filestore NFS exported from a cluster. This provides almost complete resilience from hardware failures, and moves us away from depending on individual end-user systems with locally attached filestore.

Given the inherent insecurities with NFS we opted to use IPsec authentication (but not encryption) between the hosts involved. The NFS server only accepts connections from a list of hosts, and we know those hosts are who they say they are by relying on the IPsec authentication. We’ve also made it use privileged ports to ensure local users don’t try any spoofing :-)

The trade-off here appears to be latency. I’ve done some completely unscientific tests that involved shovelling UDP data at a fixed rate between two machines. These are the ”jitter” figures they produced:

  • 0.10ms - direct
  • 0.30ms - via router
  • 0.70ms - via router with IPsec

Bear in mind that those figures might not bear any relation to the latencies involved with NFS packets, but it should give an idea of the relative delays added by routing and IPsec.

We could, to some extent, reduce those figures by replacing hardware. Quicker routers would undoubtedly remove some of the routing latency, and quicker machines could perform the IPsec calculations faster. But this probably isn’t the cheapest solution.

The first test I want to try is adding a private network between the NFS server and NFS client, with no routing involved. Seeing as it’s private we can reasonably trust that people won’t be able to spoof packets on that network and remove the IPsec authentication. In theory, these differences could signficantly reduce the latencies involved.

We’ll continue to monitor this for a while first, though. We need to keep an eye on loading on the NFS server, network usage, and so on. But, at the moment, it seems likely the problems are in the network part of NFS communication process.

Erm, whoops?

Friday, May 12th, 2006 in Work

I’d finally finished migrating everything off the old myrtle disk arrays, so I was feeling quite pleased. I’d just unplugged the last array from myrtle and plugged it in to the test machine for wiping. Then I tried to log in to the machine room SunRay, but strangely it didn’t work.

I checked the console logs for myrtle and was surprised to see it counting “12%… 13%… 14%”. I glanced up and saw my colleagues attempting to come in to the machine room and tell me something, but for some reason were unable to open the door. Scrolling back over the console logs I saw what it was up to:

panic[cpu2]/thread=2a100105d40: md: Panic due to lack of DiskSuite state database replicas. Fewer than 50% of the total were available, so panic to ensure data integrity.

That made immediate sense to me, and I gave myself a bit of a kick. The RAID system we use for internal disks, DiskSuite (actually Volume Manager now, but it seems they haven’t updated this error message), has state databases stored on every disk. On myrtle we had 6 - two on the internal disks, and one on each of the four disk arrays. You need at least 50% for things to work.

A week or so ago I removed the first pair of arrays without any problems. At that point we had 4 out of 6 databases. Today I removed the last 2 giving us only 2 remaining, which is less than 50%, and the machine dutifully paniced itself.

Fixing it was made tricky by the fact that it could no longer mount the root filesystem because the RAID wouldn’t start. Thankfully the arrays were still to hand, so I just plugged them back in. After booting I removed the databases from the arrays, and added an additional one on each of the internal disks - this gives us 4 in total, 2 on each disk, which is what we normally do.

I also used the handy opportunity to mount the new filestore directly on /home and /proj, rather than using symlinks.

I’ll end this post with a bit of a rant. I can understand why the system won’t boot with less than 50% of the state databases - it has no way of knowing if they represent the correct state of things. But, what I don’t understand is why it needs to panic the system when it has less than 50%. It knows the remaining ones are valid because they’re currently in use. In fact panicing just makes it harder for the sysadmin to deal with the problem. Or am I missing something?

A new web site

Thursday, May 11th, 2006 in Work

The Computer Science department got it’s new website (more of a re-skin, actually) yesterday, so I decided it was about time to update my staff page.

I’ve brought it in-line with the new look of the main website, although it doesn’t completely follow the standard templates. It’s also XHTML 1.1 valid, which makes the pedantic side of me feel much better.

There’s not much else to say - go take a look (unless you’re already there), and let me know what you think.

Strange kerberos problems

Tuesday, May 9th, 2006 in Work

A few days ago one of our users reported that they couldn’t change their password. The error coming out of the passwd command was confusing in itself - it said ‘bad old password’, or similar, which turns out to be a bug in our wrapper script.

After some investigation we discovered that neither kadmin or kpasswd worked:

tdb [~] % kadmin -p tdb/admin
Enter Password:
kadmin: Operation failed for unspecified reason while
initializing kadmin interface
tdb [~] % kpasswd
kpasswd: Changing password for tdb.
Old password:
kpasswd: Cannot establish a session with the Kerberos
administrative server for realm CS.UKC.AC.UK.
Operation failed for unspecified reason.

The completely unhelpful bit there is the “failed for unspecified reason” error message. How are you meant to even begin debugging that? After a couple of hours digging I logged the call with Sun.

It turns out that there is a known bug:

Document ID:6410919
Title:Patch 112908-24 will cause the kadmin -p kws/admin to exit with a error message

The solution presented was to remove patch 112908-24. This time I’m willing to do that, but from past experience I’d like to see them actually fix the problem rather than just back it out. Or, at the very least, remove the patch from cluster patches. Otherwise in 6 months time I’m left staring at the same problem.

What I’ve found most interesting in all this is that it took the best part of a month for anyone to notice passwords couldn’t be changed :-)

The end of the T3 saga

Friday, May 5th, 2006 in Work

So after copying everyone off the limping T3 arrays I arranged for a Sun engineer to return to site to fix it properly. Sun Dispatch had a bit of a moan because I’d had the parts for too long, but they realised it’d make most sense to keep the parts on site rather than collect them and then send them back to me :-)

After replacing a loop card in the secondary unit the problems mostly went away. I guess this makes sense since it was the second unit we were having trouble connecting to. The engineer also replaced the primary controller which fixed the remaining minor problems. Finally we had a working array.

I’ve now wiped the disks, updated the firmware (not really sure why I did that!), and cleared the NVRAM. They’re all ready to pass on to someone else now.

So, just the failing cluster node and broken KDC to deal with… :-(

The T3 lives?

Thursday, April 27th, 2006 in Work

After yesterdays saga I was looking forward to an easier day today, but I didn’t get it.

At the end of my last post I was trying to disable the primary controller in the array. It took a while, but it didn’t help. However, after some more discussion with Paul at Sun we noticed a lot of errors for the primary loop. I disabled that and the errors instantly stopped. Success!

So I then had a couple of options:

  1. Let the on-site engineer come today and try replacing parts, which would causing more downtime for myrtle.
  2. Leave the array as it is until after the bank holiday weekend and hope it keeps working.
  3. Start the planned migration to new filestore immediately, and fix the hardware later.

I decided that the third option was best, and cancelled the engineer that was arranged for today.

Today didn’t start so well though. At 7:15 Sun Dispatch phoned to confirm the ETA for the engineer. I explained it should have been cancelled, which was fine except for the parts had already being shipped. At 7:20 I get a call from a couple of DHL drivers who are sitting in the Maths car park at work. I was at home only just awake. Fortunately they agreed to bring the parts to my house instead.

Arriving in the office this morning I noticed a failure on one of our cluster nodes. It looks like hardware, so that should be easy enough to fix. Thankfully there’s three other nodes that fairly seemlessly took over the workload of this node. This won’t affect the migration of filestore to the cluster.

I’m now in the process of copying people off the old arrays. It’s going quite slowly - maybe the array isn’t running at full capacity, I’m not sure. Once this is done we can take the arrays of myrtle and let Sun fix them.

A T3 goes bang

Wednesday, April 26th, 2006 in Work

We have a fairly long standing hatred of the Sun T3 storage arrays, and last night they once again proved why we feel that way.

At around 7pm last night I noticed a lot of SCSI errors on myrtle (our staff and research Solaris server) which I quickly tracked down to a problem with one of the attached T3 arrays. I was rather surprised to see what I found in the T3 logs:

W: u1d6 SCSI Disk Error Occurred (path = 0x0)
W: Sense Key = 0xb, Asc = 0x47, Ascq = 0x0
W: Sense Data Description = SCSI Parity Error
W: Valid Information = 0x2049a82
...
N: u1ctr ISP2200[0] Received LIP(f7,e8) async event

And pages and pages of the above and other fairly obscure looking messages. It seemed every single disk had a failure on it, which was quite unlikely. I tried to power cycle the array but it refused to shut down.

Thankfully this machine has a gold level support contract with Sun, so I phoned up their “UK Mission Critical Solution Centre” for some assistance. We didn’t really achieve too much other than sending logs back and forth, and prodding a few things. Eventually, seemingly by itself, the array decided that it would disable one of the disks and then everything seemed to go quiet. It was gone 10pm by this point, so I was quite relieved by the spontaneous fix.

It had tried to rebuild on to the hot spare, but that had failed too. So we were left with a slightly creaky, but working, raid 5 array with no redundancy at all. I mounted the file system up and scheduled a full backup overnight, and surprisingly by the morning it was still working. We still had disk errors though, but only for one disk which was now disabled:

N: u1d6 sid 111 stype 2024 disk error 3

Later today a Sun engineer arrived to replace both of the disks that had shown errors (one of which was the hot spare). With both replaced rebuilds started with a lot of error messages. We decided it was best to power everything down and kick the rebuilds off again.

The array went round in a loop a few times: sync to spare, sync back, sync to spare, sync back. Eventually it stopped, and I reconnected it to the host system, which of course didn’t detect it. Time for another reboot :-)

And, much to my annoyance, that didn’t work. It seems the luns are fine when unmounted, but as soon as the OS gets at them we get problems. Back on the phone with Sun and they’ve agreed to send new parts for just about everything, but that’ll mean another 12 hours or so without home directories on myrtle (for half the users).

I’m trying one last thing, though - disabling the primary controller. It probably won’t work, but it’s worth a try.

Did I mention I hate T3s?

What a weekend

Sunday, April 16th, 2006 in Work

On Thursday I arrived back from Cornwall after a fairly lengthy drive, and to get back in to the swing of things I dived right in to the deep end at work.

This weekend there was a complete power shutdown on the campus for some “essential electrical work”. This required us to shut down all our machines, wait a few hours, and then start them all up again. Doesn’t sound too hard, does it?

That’s what I thought anyway. So, to spice things up a bit I figured I’d patch Solaris on all our servers, patch the OBP firmware on all the Sun kit, and update our Veritas cluster with a maintenance pack. My logic behind doing this is that all these things require downtime, and the cluster in particular would be quite disruptive. So what better time to do it than when everything is already down? Doing it the usual way would require downtime on Tuesday mornings for the next month.

On Friday I began patching all the machines I could safely reboot without impacting any of our users. This would have been made easier if our console servers were working, but a quick drive to work fixed that one. This was closely followed by another drive to work to turn the keylock on the servers so I could update the OBP firmwares. At the end of the day I was left with just a few core machines to patch.

Saturday morning started early, some time around 6.30am. I stumbled over to my desk at home to be presented with a dead X session - it looks like Xorg crashed during the night. After about 15-20 minutes of faffing I had everything back running again, and all the relevant tools opened up. I started patching the remaining machines, and the last few OBP firmwares. After a quick shower I popped up to work for around 8am. The power was scheduled to go off at 9am, so I had an hour to make sure everything was shut down. No real problems there, and finished with time to spare. Then I twiddled my thumbs for a further 30 minutes waiting for the power to actually go off.

It’s remarkably eerie to be in the machine room in the dark with most things off. The silence is only broken by the beeps from dying UPSs and the relatively quiet whiring from the core networking equipment (which has an impressively large UPS). Anyway, I digress…

Then follows the boring bit - waiting for the power to come back on. I decided to go in to town, do some shopping, then head off to Sainsburys for the weekly shop. By the time I’d done all this, and watched a bit of TV, it was 2pm and time to head back to work.

The power was scheduled to come back on by 5pm at the latest, but handily it came on earlier. Sometime around 3pm the lights came back on and the air conditioning kicked off with a massive roar. We waited for a further 30 minutes to get the all clear from maintenance before starting to power things on though. I used this time to move a machine, repatch some cables, and get the networking back online.

Earlier I mentioned I also wanted to patch the cluster. I decided to do this as the first thing after bringing the power back online. I’d already arranged for the cluster to not fully start up, so all I needed to do was bring the relevant machines back online and kick off the patch installer. Patching went fine, albeit taking a while; I found a Snickers bar was a good way to fill this time. Next I had to start the cluster up, which wasn’t so easy. After getting it running I couldn’t start any services - it kept returning something similar to the following:

Service group has not been fully probed on the system

It took a fair amount of head scratching, and a bit googling to realise that I needed to take a copy of the latest types.cf and put it in my VCS config directory. Did I miss that in the upgrade documention, or was it just not there? Either way, after doing that the cluster started up without any further problems.

Next I powered up all the remaining systems in order, which took a while. I did have a couple of problems though:

  1. One of the mirror service machines paniced on boot. Trying a different kernel fixed that, but the config really should have been right in the first place.
  2. Our web server has a failing system disk. It’s mirrored, so it’s not a big deal, but the disk keeps limping rather than failing - the result being a pretty slow machine.

At around 7pm I’d got everything going again, so I headed to the office to quickly check my email. When my email client refused to load I was too exhausted to care, so headed home for dinner (and Dr Who).

An hour or so later I went back to try and figure out what was going on. Handily a colleauge had also seen the problem and queried whether lockd was running. That made sense; I assume my mail client couldn’t lock it’s mailbox on the NFS server. A quick check revealed it wasn’t running on the cluster NFS server. I haven’t investigated why, yet, but I hope it’s something I’ve done wrong rather than yet another bug.

By the time that was sorted, and I’d read all my email, the only remaining thing to do was log the disk fault with Sun. I was surprised to have an engineer get back to me so late on a weekend, but I guess that’s the advantage of Gold support. I eventually gave up conversing by email at around midnight and went to bed.

I awoke later than usual on Sunday thinking everything was fine. I wandered over to my computer, which hadn’t crashed this time, and was rather annoyed at what I saw; a whole bunch of error messages about not being able to contact work servers. A load of things went through my head:

  • Has myrtle crashed? hard to tell, can’t get at the console.
  • Has the power gone off again? can’t get at any of our machines, but I can get to a service one, so seems not.
  • Has the networking died? can get to service equipment and can ping at least one of our routers.

So back up to work, again, to see what’s going on. I concluded it was a problem with the service router that our router is connected to (it’s happened before), so I pulled the cable out. After a short pause the failover link from our second router to the second service router kicked in, and most stuff started ticking again. There are still routing problems, though, which means amongst other things that we’re only getting some emails.

Back at home I waded through a whole mass of emails generated by the network outage, and find a few from Sun. They’ve also concluded the disk in the webserver is dead, so hopefully we’ll get that replaced on Tuesday.

That leaves me with a few things to sort out:

  1. Investigate why the cluster didn’t start lockd.
  2. Find out what happened to the networking.
  3. Get the disk changed in the webserver.

But they can certainly wait until I’m back at work.