A T3 goes bang

We have a fairly long standing hatred of the Sun T3 storage arrays, and last night they once again proved why we feel that way.

At around 7pm last night I noticed a lot of SCSI errors on myrtle (our staff and research Solaris server) which I quickly tracked down to a problem with one of the attached T3 arrays. I was rather surprised to see what I found in the T3 logs:

W: u1d6 SCSI Disk Error Occurred (path = 0x0)
W: Sense Key = 0xb, Asc = 0x47, Ascq = 0x0
W: Sense Data Description = SCSI Parity Error
W: Valid Information = 0x2049a82
...
N: u1ctr ISP2200[0] Received LIP(f7,e8) async event

And pages and pages of the above and other fairly obscure looking messages. It seemed every single disk had a failure on it, which was quite unlikely. I tried to power cycle the array but it refused to shut down.

Thankfully this machine has a gold level support contract with Sun, so I phoned up their “UK Mission Critical Solution Centre” for some assistance. We didn’t really achieve too much other than sending logs back and forth, and prodding a few things. Eventually, seemingly by itself, the array decided that it would disable one of the disks and then everything seemed to go quiet. It was gone 10pm by this point, so I was quite relieved by the spontaneous fix.

It had tried to rebuild on to the hot spare, but that had failed too. So we were left with a slightly creaky, but working, raid 5 array with no redundancy at all. I mounted the file system up and scheduled a full backup overnight, and surprisingly by the morning it was still working. We still had disk errors though, but only for one disk which was now disabled:

N: u1d6 sid 111 stype 2024 disk error 3

Later today a Sun engineer arrived to replace both of the disks that had shown errors (one of which was the hot spare). With both replaced rebuilds started with a lot of error messages. We decided it was best to power everything down and kick the rebuilds off again.

The array went round in a loop a few times: sync to spare, sync back, sync to spare, sync back. Eventually it stopped, and I reconnected it to the host system, which of course didn’t detect it. Time for another reboot :-)

And, much to my annoyance, that didn’t work. It seems the luns are fine when unmounted, but as soon as the OS gets at them we get problems. Back on the phone with Sun and they’ve agreed to send new parts for just about everything, but that’ll mean another 12 hours or so without home directories on myrtle (for half the users).

I’m trying one last thing, though – disabling the primary controller. It probably won’t work, but it’s worth a try.

Did I mention I hate T3s?

  • Share/Bookmark

Related posts:

  1. “Any idea WTF is going on?” “Any idea WTF is going on?” is what I read on my phone as I stumbled out of bed this morning. It was from one of my colleagues who, for some reason I can’t understand, seems to like getting in to work at a ridiculous hour in the morning. Still half asleep I plodded through [...]...
  2. Bad things come in fives. Thursday 22 February. That’s the day it all went wrong. I was on my way home from a shopping trip at Sainsburys. We’d been on a Thursday instead of a Friday because we had to go to Cornwall on the Friday afternoon for a funeral (that’s bad thing number one). Just after leaving Sainsburys we came [...]...
  3. Erm, whoops? I’d finally finished migrating everything off the old myrtle disk arrays, so I was feeling quite pleased. I’d just unplugged the last array from myrtle and plugged it in to the test machine for wiping. Then I tried to log in to the machine room SunRay, but strangely it didn’t work. I checked the console logs [...]...
  4. The end of the T3 saga So after copying everyone off the limping T3 arrays I arranged for a Sun engineer to return to site to fix it properly. Sun Dispatch had a bit of a moan because I’d had the parts for too long, but they realised it’d make most sense to keep the parts on site rather than collect [...]...
  5. Sigh. Stupid array. After all the fun the other day I was hoping for some time to work on other stuff this week. By the end of the weekend the array had finishing syncing and I’d remirrored all the volumes back on to it. It was all ticking over nicely, until this morning… Unrecoverable Controller Error Encountered ! [...]...

Leave a Reply