Archive for the ‘Work’ Category

The T3 lives?

Thursday, April 27th, 2006 in Work

After yesterdays saga I was looking forward to an easier day today, but I didn’t get it.

At the end of my last post I was trying to disable the primary controller in the array. It took a while, but it didn’t help. However, after some more discussion with Paul at Sun we noticed a lot of errors for the primary loop. I disabled that and the errors instantly stopped. Success!

So I then had a couple of options:

  1. Let the on-site engineer come today and try replacing parts, which would causing more downtime for myrtle.
  2. Leave the array as it is until after the bank holiday weekend and hope it keeps working.
  3. Start the planned migration to new filestore immediately, and fix the hardware later.

I decided that the third option was best, and cancelled the engineer that was arranged for today.

Today didn’t start so well though. At 7:15 Sun Dispatch phoned to confirm the ETA for the engineer. I explained it should have been cancelled, which was fine except for the parts had already being shipped. At 7:20 I get a call from a couple of DHL drivers who are sitting in the Maths car park at work. I was at home only just awake. Fortunately they agreed to bring the parts to my house instead.

Arriving in the office this morning I noticed a failure on one of our cluster nodes. It looks like hardware, so that should be easy enough to fix. Thankfully there’s three other nodes that fairly seemlessly took over the workload of this node. This won’t affect the migration of filestore to the cluster.

I’m now in the process of copying people off the old arrays. It’s going quite slowly – maybe the array isn’t running at full capacity, I’m not sure. Once this is done we can take the arrays of myrtle and let Sun fix them.

  • Share/Bookmark

A T3 goes bang

Wednesday, April 26th, 2006 in Work

We have a fairly long standing hatred of the Sun T3 storage arrays, and last night they once again proved why we feel that way.

At around 7pm last night I noticed a lot of SCSI errors on myrtle (our staff and research Solaris server) which I quickly tracked down to a problem with one of the attached T3 arrays. I was rather surprised to see what I found in the T3 logs:

W: u1d6 SCSI Disk Error Occurred (path = 0x0)
W: Sense Key = 0xb, Asc = 0x47, Ascq = 0x0
W: Sense Data Description = SCSI Parity Error
W: Valid Information = 0x2049a82
...
N: u1ctr ISP2200[0] Received LIP(f7,e8) async event

And pages and pages of the above and other fairly obscure looking messages. It seemed every single disk had a failure on it, which was quite unlikely. I tried to power cycle the array but it refused to shut down.

Thankfully this machine has a gold level support contract with Sun, so I phoned up their “UK Mission Critical Solution Centre” for some assistance. We didn’t really achieve too much other than sending logs back and forth, and prodding a few things. Eventually, seemingly by itself, the array decided that it would disable one of the disks and then everything seemed to go quiet. It was gone 10pm by this point, so I was quite relieved by the spontaneous fix.

It had tried to rebuild on to the hot spare, but that had failed too. So we were left with a slightly creaky, but working, raid 5 array with no redundancy at all. I mounted the file system up and scheduled a full backup overnight, and surprisingly by the morning it was still working. We still had disk errors though, but only for one disk which was now disabled:

N: u1d6 sid 111 stype 2024 disk error 3

Later today a Sun engineer arrived to replace both of the disks that had shown errors (one of which was the hot spare). With both replaced rebuilds started with a lot of error messages. We decided it was best to power everything down and kick the rebuilds off again.

The array went round in a loop a few times: sync to spare, sync back, sync to spare, sync back. Eventually it stopped, and I reconnected it to the host system, which of course didn’t detect it. Time for another reboot :-)

And, much to my annoyance, that didn’t work. It seems the luns are fine when unmounted, but as soon as the OS gets at them we get problems. Back on the phone with Sun and they’ve agreed to send new parts for just about everything, but that’ll mean another 12 hours or so without home directories on myrtle (for half the users).

I’m trying one last thing, though – disabling the primary controller. It probably won’t work, but it’s worth a try.

Did I mention I hate T3s?

  • Share/Bookmark

What a weekend

Sunday, April 16th, 2006 in Work

On Thursday I arrived back from Cornwall after a fairly lengthy drive, and to get back in to the swing of things I dived right in to the deep end at work.

This weekend there was a complete power shutdown on the campus for some “essential electrical work”. This required us to shut down all our machines, wait a few hours, and then start them all up again. Doesn’t sound too hard, does it?

That’s what I thought anyway. So, to spice things up a bit I figured I’d patch Solaris on all our servers, patch the OBP firmware on all the Sun kit, and update our Veritas cluster with a maintenance pack. My logic behind doing this is that all these things require downtime, and the cluster in particular would be quite disruptive. So what better time to do it than when everything is already down? Doing it the usual way would require downtime on Tuesday mornings for the next month.

On Friday I began patching all the machines I could safely reboot without impacting any of our users. This would have been made easier if our console servers were working, but a quick drive to work fixed that one. This was closely followed by another drive to work to turn the keylock on the servers so I could update the OBP firmwares. At the end of the day I was left with just a few core machines to patch.

Saturday morning started early, some time around 6.30am. I stumbled over to my desk at home to be presented with a dead X session – it looks like Xorg crashed during the night. After about 15-20 minutes of faffing I had everything back running again, and all the relevant tools opened up. I started patching the remaining machines, and the last few OBP firmwares. After a quick shower I popped up to work for around 8am. The power was scheduled to go off at 9am, so I had an hour to make sure everything was shut down. No real problems there, and finished with time to spare. Then I twiddled my thumbs for a further 30 minutes waiting for the power to actually go off.

It’s remarkably eerie to be in the machine room in the dark with most things off. The silence is only broken by the beeps from dying UPSs and the relatively quiet whiring from the core networking equipment (which has an impressively large UPS). Anyway, I digress…

Then follows the boring bit – waiting for the power to come back on. I decided to go in to town, do some shopping, then head off to Sainsburys for the weekly shop. By the time I’d done all this, and watched a bit of TV, it was 2pm and time to head back to work.

The power was scheduled to come back on by 5pm at the latest, but handily it came on earlier. Sometime around 3pm the lights came back on and the air conditioning kicked off with a massive roar. We waited for a further 30 minutes to get the all clear from maintenance before starting to power things on though. I used this time to move a machine, repatch some cables, and get the networking back online.

Earlier I mentioned I also wanted to patch the cluster. I decided to do this as the first thing after bringing the power back online. I’d already arranged for the cluster to not fully start up, so all I needed to do was bring the relevant machines back online and kick off the patch installer. Patching went fine, albeit taking a while; I found a Snickers bar was a good way to fill this time. Next I had to start the cluster up, which wasn’t so easy. After getting it running I couldn’t start any services – it kept returning something similar to the following:

Service group has not been fully probed on the system

It took a fair amount of head scratching, and a bit googling to realise that I needed to take a copy of the latest types.cf and put it in my VCS config directory. Did I miss that in the upgrade documention, or was it just not there? Either way, after doing that the cluster started up without any further problems.

Next I powered up all the remaining systems in order, which took a while. I did have a couple of problems though:

  1. One of the mirror service machines paniced on boot. Trying a different kernel fixed that, but the config really should have been right in the first place.
  2. Our web server has a failing system disk. It’s mirrored, so it’s not a big deal, but the disk keeps limping rather than failing – the result being a pretty slow machine.

At around 7pm I’d got everything going again, so I headed to the office to quickly check my email. When my email client refused to load I was too exhausted to care, so headed home for dinner (and Dr Who).

An hour or so later I went back to try and figure out what was going on. Handily a colleauge had also seen the problem and queried whether lockd was running. That made sense; I assume my mail client couldn’t lock it’s mailbox on the NFS server. A quick check revealed it wasn’t running on the cluster NFS server. I haven’t investigated why, yet, but I hope it’s something I’ve done wrong rather than yet another bug.

By the time that was sorted, and I’d read all my email, the only remaining thing to do was log the disk fault with Sun. I was surprised to have an engineer get back to me so late on a weekend, but I guess that’s the advantage of Gold support. I eventually gave up conversing by email at around midnight and went to bed.

I awoke later than usual on Sunday thinking everything was fine. I wandered over to my computer, which hadn’t crashed this time, and was rather annoyed at what I saw; a whole bunch of error messages about not being able to contact work servers. A load of things went through my head:

  • Has myrtle crashed? hard to tell, can’t get at the console.
  • Has the power gone off again? can’t get at any of our machines, but I can get to a service one, so seems not.
  • Has the networking died? can get to service equipment and can ping at least one of our routers.

So back up to work, again, to see what’s going on. I concluded it was a problem with the service router that our router is connected to (it’s happened before), so I pulled the cable out. After a short pause the failover link from our second router to the second service router kicked in, and most stuff started ticking again. There are still routing problems, though, which means amongst other things that we’re only getting some emails.

Back at home I waded through a whole mass of emails generated by the network outage, and find a few from Sun. They’ve also concluded the disk in the webserver is dead, so hopefully we’ll get that replaced on Tuesday.

That leaves me with a few things to sort out:

  1. Investigate why the cluster didn’t start lockd.
  2. Find out what happened to the networking.
  3. Get the disk changed in the webserver.

But they can certainly wait until I’m back at work.

  • Share/Bookmark

Escaping for a while

Sunday, April 2nd, 2006 in General, Work

In an attempt to allow my mind to rest from work-related matters I’m heading off to Cornwall for a couple of weeks.

I’ll be back for the Easter weekend when I’ll be shutting down all our systems at work for a campus-wide power shutdown.

  • Share/Bookmark

Upgrading Debian

Tuesday, March 28th, 2006 in Computing, Work

If you’ve been following my blog you’ll know that I’ve been working on a new filestore project at work for a while now. After getting things working nicely on our Solaris machines, and finally moving my home directory over, I decided to tackle our Debian server. It quickly became apparent that I’d need to upgrade the machine, which was running Woody with a 2.4 kernel, to get to a decent IPsec and autofs setup.

Now, I’m not a Linux user, let alone a Debian one. So this was a new experience for me. After a quick nose around online, and with a few helpful pointers, I found some useful instructions on how to upgrade. It boils down to a fairly simple process;

  1. Make sure the system is running the latest Woody updates.
  2. Modify apt sources.list file to change woody to sarge.
  3. Run apt-get update.
  4. Install/update aptitude.
  5. Run aptitude -f --with-recommends dist-upgrade to do the full upgrade.

Then it’s just a case of fixing up any conflicting files and changes, and you’re done. I had to remove our backup software (lgtoclnt) and re-add it though, because it messed with the X packages.

I decided at this point to make sure Sarge worked before looking at the kernel. So I rebooted the system. I waited. And I waited some more. The console showed that it had gone through the BIOS and RAID POST, but nothing else. A brief trip back to the machine room showed a scary looking “LI” message, which I knew meant lilo wasn’t working.

At this point I consulted some friends who explained what I needed to do. A short while later, and with a freshly burnt boot CD, I had the system back up and running. To reinstall lilo I’d booted the CD up to the point where it loaded the aacraid drivers, switched to another terminal, mounted my root parition, chrooted, and run lilo.

By this point I’m starting to grumble about Linux/Debian being stupid. But, I move on. I discover that I’m also going to need to upgrade to 2.6 if I’m going to get IPsec support. After a short while of looking at rebuilding kernels, and boggling at the myriad of build options available, I decide to apt-get install kernel-image-2.6. That can’t be too hard, can it? A few moments later I’m left staring at an Oops message referring to a “kernel NULL point deference” which appears to have come from the install running dd.

Nasty. Anyway, to cut a long story short I tweaked the postinst script to stop it running dd, and that allowed me to get the kernel installed. Surprisingly it worked first time, but I did have to fix the modules list afterwards to silence some error messages.

Now a few hours later, and after discovering the difference between autofs4 and the Solaris automounter, I now have a working system. But I’m left wondering why I’d really want to be using Debian at all.

  • Share/Bookmark

Now what? It’s too scary to use…

Saturday, March 25th, 2006 in Work

Its been months in the making, but it’s finally done. We have our new filestore ready to go. There’s still plenty to do, like rolling it out for the teaching machines and web filestore, but at least we’ve got the main part done.

So why has it taken so long? I spent a long time researching and testing the technologies involved. For example, choosing the file system was tricky. UFS doesn’t work well on large (>2TB) file systems, and VxFS doesn’t work with NFS and Quotas. I managed to solve that one by fixing the quota issue with VxFS. There was also the issue of how we backup this quantity of filestore, and working out how we’d make it available from the cluster to the user machines. In the end we opted for a single filesystem split in to chunks on the server side for backups and used the automounter to make these divisions transparent to the end users.

The other time consuming factor was the software development stage. We have automated systems for creating users on machines, so I needed to integrate this with the new filestore. This required writing code to facilitate the creation of directories, setting up of quotas, and automount map building.

Anyway, I’ve written about this before. So now it’s done what do we do next? The logical step is to test it on myself and/or the rest of the systems group. Personally I’m in of favour testing it on everyone else first, but that doesn’t seem fair :-)

The question is, am I brave enough to actually use it?

  • Share/Bookmark

Why I absolutely hate spam

Tuesday, March 21st, 2006 in Computing, Work

If there’s one thing that drives me completely insane in the modern world of computing it’s spam. It consumes my time, day after day, and devours the resources of our mail systems. In my own mailbox I get a few hundred spam messages a day, most of which I’ll never even see, let alone read. Thankfully most of these are filtered, but there’s still at least 20+ which I have to manually deal with every morning.

At work the mail systems for the Computer Science department are processing around 20,000 incoming email messages every day. A remakable 61% of these are spam, which is quite an increase from 49% a year ago. We run two mail hubs to process the incoming email which means we’ve effectively had to buy and run one server just for processing the spam email. I don’t even want to start on the amount of time spent dealing with spam messages that make it through to our helpdesk systems.

Ever noticed how spam email comes from rather an ecletic selection of email addresses? Has one of those addresses ever been yours? If there’s one type of email even more annoying that spam it’s bounces generated as a result of spam, sometimes thousands of them. You’ve suddenly become an unwilling victim of spam. Your address abused, and maybe even your name tarnished. What gives spammers the right to do this? At least SPF and similar technologies go some way to preventing this.

And as if spam email wasn’t enough we now see it creeping in to many other Internet based systems. How long until there’s a spam comment on this weblog? Or a stack of spam referrer entries in my apache logs (and consequently my statistics)? Or until I receive the next random message on one of my messenger services?

Whilst I’m ranting, another thing I can’t stand are those pages of junk links that appear when you try and google for something, particularly if it’s a fairly common term. Thankfully google is trying to deal with that, but it’ll be a neverending battle.

It seems in the non-Internet world we can easily regulate junk messages. We used to get a fair amount of sales telephone calls and general junk mail through the front door. Within weeks of registering with the Mail Preference Service and the Telephone Preference Service these have completely stopped. I’m not naive enough to believe this could be done with the Internet, but it helps put things in to perspective.

One of these days I’m going to get sick of the battle and just say “screw ‘em all” and unplug my ADSL modem. After all, people keep telling me I should try reading more books.

  • Share/Bookmark

Impending doom (for our filesystems, anyway)

Friday, March 17th, 2006 in Work

Over the past year or so the space usage on our research and web filesystems has pretty much doubled to the point where we’re dangerously close to running out of space. There’s currently about 1TiB of filestore available of which less than 10% remains unused.

Teaching filestore, however, has barely grown at all during the last year. I attribute this primarily to quota control, but also to the regular turnover of undergraduate students.

Fortunately we saw this problem arising quite a while ago, so we’ve had time to purchase new storage and infrastructure that should alleviate this problem and make it easier for us to expand the storage availability in the future.

Our new system consists of a pair of Sun StorEDGE 3511 arrays attached by fibre channel to our existing Veritas cluster. We’ll use VxFS for the filesystems, which could lead to some interesting new technologies like filesystem checkpointing; we could have a mount point of /yesterday to allow users to retrieve their files as they were at some point during the previous day, thereby reducing the need for us to do tape restores. VxFS also works quite happily with large filesystems, unlike Solaris UFS. The only problem we’ve found is that VxFS doesn’t support hard linking directories, but that’s not something we commonly, if ever, want to do. We also initially had problems integrating VxFS with the Solaris quota system over NFS, but we soon fixed that the “fun” way :-)

Currently the research and teaching servers have locally attached filestore, which means if we have a hardware failure in one of the main servers we’re unable to get at user filestore from any other systems (without moving cables). The new solution provides NFS mounts of the filestore directly to each of the servers, which will allow files to be accessed via secondary machines should one of the main servers die. This is all part of our long term plan to increase the resilience of our systems.

One other interesting point to note is the use of the Solaris automounter to individually mount user home directories. Soon there’ll be mounts a bit like this all over the place:

resfs.cs:/home/cur/tdb 1.5T 54G 1.4T 4% /home/cur/tdb

Which will make things much more interesting!

  • Share/Bookmark