I’d finally finished migrating everything off the old myrtle disk arrays, so I was feeling quite pleased. I’d just unplugged the last array from myrtle and plugged it in to the test machine for wiping. Then I tried to log in to the machine room SunRay, but strangely it didn’t work.
I checked the console logs for myrtle and was surprised to see it counting “12%… 13%… 14%”. I glanced up and saw my colleagues attempting to come in to the machine room and tell me something, but for some reason were unable to open the door. Scrolling back over the console logs I saw what it was up to:
panic[cpu2]/thread=2a100105d40: md: Panic due to lack of DiskSuite state database replicas. Fewer than 50% of the total were available, so panic to ensure data integrity.
That made immediate sense to me, and I gave myself a bit of a kick. The RAID system we use for internal disks, DiskSuite (actually Volume Manager now, but it seems they haven’t updated this error message), has state databases stored on every disk. On myrtle we had 6 – two on the internal disks, and one on each of the four disk arrays. You need at least 50% for things to work.
A week or so ago I removed the first pair of arrays without any problems. At that point we had 4 out of 6 databases. Today I removed the last 2 giving us only 2 remaining, which is less than 50%, and the machine dutifully paniced itself.
Fixing it was made tricky by the fact that it could no longer mount the root filesystem because the RAID wouldn’t start. Thankfully the arrays were still to hand, so I just plugged them back in. After booting I removed the databases from the arrays, and added an additional one on each of the internal disks – this gives us 4 in total, 2 on each disk, which is what we normally do.
I also used the handy opportunity to mount the new filestore directly on /home and /proj, rather than using symlinks.
I’ll end this post with a bit of a rant. I can understand why the system won’t boot with less than 50% of the state databases – it has no way of knowing if they represent the correct state of things. But, what I don’t understand is why it needs to panic the system when it has less than 50%. It knows the remaining ones are valid because they’re currently in use. In fact panicing just makes it harder for the sysadmin to deal with the problem. Or am I missing something?
Related posts:
- A T3 goes bang We have a fairly long standing hatred of the Sun T3 storage arrays, and last night they once again proved why we feel that way. At around 7pm last night I noticed a lot of SCSI errors on myrtle (our staff and research Solaris server) which I quickly tracked down to a problem with one of the attached [...]...
- Bad things come in fives. Thursday 22 February. That’s the day it all went wrong. I was on my way home from a shopping trip at Sainsburys. We’d been on a Thursday instead of a Friday because we had to go to Cornwall on the Friday afternoon for a funeral (that’s bad thing number one). Just after leaving Sainsburys we came across [...]...
- I don’t have a good history with FreeBSD RAID… I’ve never got on well with software RAID systems on FreeBSD. I’ve tried gvinum (previously I used vinum), gmirror, and ataraid, all with varying degrees of success. The latest machine I built is using gmirror, and so far I’m happy. However, over the past few days I’ve been having problems with a system I built a [...]...
- “Any idea WTF is going on?” “Any idea WTF is going on?” is what I read on my phone as I stumbled out of bed this morning. It was from one of my colleagues who, for some reason I can’t understand, seems to like getting in to work at a ridiculous hour in the morning. Still half asleep I plodded through to [...]...
- The end of an era, or two This week we’ve finally seen the end of some things I’ve been trying to sort out for some time now. The old storage arrays (Sun T3s and A1000s) are finally gone. The T3 arrays in particular have caused us endless grief over the past few years, so I’m more than happy to see them go. It [...]...