Archive for 2006

Erm, whoops?

Friday, May 12th, 2006 in Work

I’d finally finished migrating everything off the old myrtle disk arrays, so I was feeling quite pleased. I’d just unplugged the last array from myrtle and plugged it in to the test machine for wiping. Then I tried to log in to the machine room SunRay, but strangely it didn’t work.

I checked the console logs for myrtle and was surprised to see it counting “12%… 13%… 14%”. I glanced up and saw my colleagues attempting to come in to the machine room and tell me something, but for some reason were unable to open the door. Scrolling back over the console logs I saw what it was up to:

panic[cpu2]/thread=2a100105d40: md: Panic due to lack of DiskSuite state database replicas. Fewer than 50% of the total were available, so panic to ensure data integrity.

That made immediate sense to me, and I gave myself a bit of a kick. The RAID system we use for internal disks, DiskSuite (actually Volume Manager now, but it seems they haven’t updated this error message), has state databases stored on every disk. On myrtle we had 6 - two on the internal disks, and one on each of the four disk arrays. You need at least 50% for things to work.

A week or so ago I removed the first pair of arrays without any problems. At that point we had 4 out of 6 databases. Today I removed the last 2 giving us only 2 remaining, which is less than 50%, and the machine dutifully paniced itself.

Fixing it was made tricky by the fact that it could no longer mount the root filesystem because the RAID wouldn’t start. Thankfully the arrays were still to hand, so I just plugged them back in. After booting I removed the databases from the arrays, and added an additional one on each of the internal disks - this gives us 4 in total, 2 on each disk, which is what we normally do.

I also used the handy opportunity to mount the new filestore directly on /home and /proj, rather than using symlinks.

I’ll end this post with a bit of a rant. I can understand why the system won’t boot with less than 50% of the state databases - it has no way of knowing if they represent the correct state of things. But, what I don’t understand is why it needs to panic the system when it has less than 50%. It knows the remaining ones are valid because they’re currently in use. In fact panicing just makes it harder for the sysadmin to deal with the problem. Or am I missing something?

A new web site

Thursday, May 11th, 2006 in Work

The Computer Science department got it’s new website (more of a re-skin, actually) yesterday, so I decided it was about time to update my staff page.

I’ve brought it in-line with the new look of the main website, although it doesn’t completely follow the standard templates. It’s also XHTML 1.1 valid, which makes the pedantic side of me feel much better.

There’s not much else to say - go take a look (unless you’re already there), and let me know what you think.

Strange kerberos problems

Tuesday, May 9th, 2006 in Work

A few days ago one of our users reported that they couldn’t change their password. The error coming out of the passwd command was confusing in itself - it said ‘bad old password’, or similar, which turns out to be a bug in our wrapper script.

After some investigation we discovered that neither kadmin or kpasswd worked:

tdb [~] % kadmin -p tdb/admin
Enter Password:
kadmin: Operation failed for unspecified reason while
initializing kadmin interface
tdb [~] % kpasswd
kpasswd: Changing password for tdb.
Old password:
kpasswd: Cannot establish a session with the Kerberos
administrative server for realm CS.UKC.AC.UK.
Operation failed for unspecified reason.

The completely unhelpful bit there is the “failed for unspecified reason” error message. How are you meant to even begin debugging that? After a couple of hours digging I logged the call with Sun.

It turns out that there is a known bug:

Document ID:6410919
Title:Patch 112908-24 will cause the kadmin -p kws/admin to exit with a error message

The solution presented was to remove patch 112908-24. This time I’m willing to do that, but from past experience I’d like to see them actually fix the problem rather than just back it out. Or, at the very least, remove the patch from cluster patches. Otherwise in 6 months time I’m left staring at the same problem.

What I’ve found most interesting in all this is that it took the best part of a month for anyone to notice passwords couldn’t be changed :-)

Car Washing

Sunday, May 7th, 2006 in General

I’ve been using the Flash Car Wash to wash my car for the past year. It’s a fantastic device, and has really good results. Not only does it leave my car shining without blotchy water marks, but it removes the need to use a bucket at all.

There’s only one flaw with it though - it requires a hosepipe. Here in Kent we’ve had a hosepipe ban since last summer (which I was unaware of until recently), so using the hose is tricky. I could have pleaded ignorance, but they sent a letter not long ago. I also live on a main road, so it’s hard to do it sneakily :-)

So yesterday I’m sitting inside wondering how I’m going to rinse the car off before and after washing it with a bucket of soapy water. It was a horrible day too, absolutely pouring with rain. Then it occured to me that the rain was probably the answer - it wouldn’t contain lime would it? I’d also only just had a shower, so my hair was already wet.

Out I went to an already soaking wet car and started cleaning it. Sure enough, the rain washed the soap off nicely, albeit making me rather wet in the process.

Today I went out to check the results and it was pretty impressive - no horrible marks to be seen anywhere… apart from the bits of grass left from me strimming the front lawn :-)

Ever increasing petrol prices

Friday, May 5th, 2006 in General

I’ve done a bit of driving back and forth across the country recently, and I’ve seen quite a few places go over the £1 per litre mark. Admittedly they were the more expensive places anyway such as motorway service areas.

Fortunately I’m not a heavy user of petrol, but it still frustrates me that something like 70% of the cost is government tax. Frankly it’s an outrageous amount to charge.

So I was quite pleased to see this article written by the Fools over at The Motley Fool. It’s well worth a read.

The most interesting bit for me was the bit about saving money buying petrol - the rest of the article is mostly common sense. I headed over to petrolprices.com and discovered that the best price for unleaded in my area is 94.9p. After registering it turns out the Esso garage I generally use is charging this price. That’s a good start!

The next interesting thing was the Pipeline Card which appears to be a loyalty card that intends to get it’s members discounts on fuel prices. It’ll be very interesting to see if that takes off.

Save money on fuel with a Free Pipeline Card

There’s a couple of useful tips there, but I guess the best solution is to vote for whoever (if anyone!) says they’ll reduce fuel tax in the next election :-)

The end of the T3 saga

Friday, May 5th, 2006 in Work

So after copying everyone off the limping T3 arrays I arranged for a Sun engineer to return to site to fix it properly. Sun Dispatch had a bit of a moan because I’d had the parts for too long, but they realised it’d make most sense to keep the parts on site rather than collect them and then send them back to me :-)

After replacing a loop card in the secondary unit the problems mostly went away. I guess this makes sense since it was the second unit we were having trouble connecting to. The engineer also replaced the primary controller which fixed the remaining minor problems. Finally we had a working array.

I’ve now wiped the disks, updated the firmware (not really sure why I did that!), and cleared the NVRAM. They’re all ready to pass on to someone else now.

So, just the failing cluster node and broken KDC to deal with… :-(

Pocket GPS World charge for speed camera data!

Thursday, April 27th, 2006 in General

I’ve been using the speed camera database at Pocket GPS World with my satnav system for a while now. With the necessary tools installed on my PC it would automatically pull the latest data and install it on my PDA. Fantastic system, and no hassle at all to keep updated.

When doing an update today I thought to check the Pocket GPS World website. To my amazement they’re now charging for the database. It’s not that expensive in reality, but I don’t like services that start off free to get everyone hooked and then start charging.

So I’ve forked out a massive 2 quid to get the latest database, but whether I’ll keep doing that I don’t know.

The T3 lives?

Thursday, April 27th, 2006 in Work

After yesterdays saga I was looking forward to an easier day today, but I didn’t get it.

At the end of my last post I was trying to disable the primary controller in the array. It took a while, but it didn’t help. However, after some more discussion with Paul at Sun we noticed a lot of errors for the primary loop. I disabled that and the errors instantly stopped. Success!

So I then had a couple of options:

  1. Let the on-site engineer come today and try replacing parts, which would causing more downtime for myrtle.
  2. Leave the array as it is until after the bank holiday weekend and hope it keeps working.
  3. Start the planned migration to new filestore immediately, and fix the hardware later.

I decided that the third option was best, and cancelled the engineer that was arranged for today.

Today didn’t start so well though. At 7:15 Sun Dispatch phoned to confirm the ETA for the engineer. I explained it should have been cancelled, which was fine except for the parts had already being shipped. At 7:20 I get a call from a couple of DHL drivers who are sitting in the Maths car park at work. I was at home only just awake. Fortunately they agreed to bring the parts to my house instead.

Arriving in the office this morning I noticed a failure on one of our cluster nodes. It looks like hardware, so that should be easy enough to fix. Thankfully there’s three other nodes that fairly seemlessly took over the workload of this node. This won’t affect the migration of filestore to the cluster.

I’m now in the process of copying people off the old arrays. It’s going quite slowly - maybe the array isn’t running at full capacity, I’m not sure. Once this is done we can take the arrays of myrtle and let Sun fix them.

A T3 goes bang

Wednesday, April 26th, 2006 in Work

We have a fairly long standing hatred of the Sun T3 storage arrays, and last night they once again proved why we feel that way.

At around 7pm last night I noticed a lot of SCSI errors on myrtle (our staff and research Solaris server) which I quickly tracked down to a problem with one of the attached T3 arrays. I was rather surprised to see what I found in the T3 logs:

W: u1d6 SCSI Disk Error Occurred (path = 0x0)
W: Sense Key = 0xb, Asc = 0x47, Ascq = 0x0
W: Sense Data Description = SCSI Parity Error
W: Valid Information = 0x2049a82
...
N: u1ctr ISP2200[0] Received LIP(f7,e8) async event

And pages and pages of the above and other fairly obscure looking messages. It seemed every single disk had a failure on it, which was quite unlikely. I tried to power cycle the array but it refused to shut down.

Thankfully this machine has a gold level support contract with Sun, so I phoned up their “UK Mission Critical Solution Centre” for some assistance. We didn’t really achieve too much other than sending logs back and forth, and prodding a few things. Eventually, seemingly by itself, the array decided that it would disable one of the disks and then everything seemed to go quiet. It was gone 10pm by this point, so I was quite relieved by the spontaneous fix.

It had tried to rebuild on to the hot spare, but that had failed too. So we were left with a slightly creaky, but working, raid 5 array with no redundancy at all. I mounted the file system up and scheduled a full backup overnight, and surprisingly by the morning it was still working. We still had disk errors though, but only for one disk which was now disabled:

N: u1d6 sid 111 stype 2024 disk error 3

Later today a Sun engineer arrived to replace both of the disks that had shown errors (one of which was the hot spare). With both replaced rebuilds started with a lot of error messages. We decided it was best to power everything down and kick the rebuilds off again.

The array went round in a loop a few times: sync to spare, sync back, sync to spare, sync back. Eventually it stopped, and I reconnected it to the host system, which of course didn’t detect it. Time for another reboot :-)

And, much to my annoyance, that didn’t work. It seems the luns are fine when unmounted, but as soon as the OS gets at them we get problems. Back on the phone with Sun and they’ve agreed to send new parts for just about everything, but that’ll mean another 12 hours or so without home directories on myrtle (for half the users).

I’m trying one last thing, though - disabling the primary controller. It probably won’t work, but it’s worth a try.

Did I mention I hate T3s?

Car service

Wednesday, April 26th, 2006 in General

Yesterday my car went in for it’s first yearly service at Invicta Motors in Canterbury. All in all not a bad experience; they gave me a discount for being a valued customer, cleaned the car for me (inside and out!), and had it waiting by the door when I got there. They even fixed the mudflap clip I managed to knock off :-)

My only complaint really is about communication. I tried to book the service online (via the Ford website - I hadn’t found the Summit one then), but it seems that vanished in to the ether - I ended up booking again over the phone. They also said they’d phone me when the car was ready, but they didn’t.

Anyway, that’s all made better by the free pen they gave me ;-)