A solution? It’s all Sun’s fault.

I’ve made a couple of posts lately about the problems we’ve been having with one of our Sun 3511 disk arrays. Sun got back to me today with what they thought the problem was. Here’s the gory detail (slightly truncated to fit this blog post):


Ch  Id Chassis Vendor/Product ID        Rev  PLD
 2  28 092131  SUN StorEdge 3511F A     0430 1000
 3  28 092131  SUN StorEdge 3511F D     0420 1100*

* indicates SES or PLD firmware mismatch.

This appears in a few places, so I’ve only quoted the one Sun spotted. It boils down to the components in the array having inconsistent firmware revisions. This could very well have caused the crash we saw yesterday.

This is something I blame Sun for. Firstly, they shipped out a controller with mismatched firmwares on it. I guess this sort of thing might happen though, but the field engineer should really have spotted the mistake when he was onsite and getting the replacement controller configured.

Today Sun wanted to send out another engineer to get the firmware updated, and when I came back from shifting some stuff around I had a voicemail from dispatch. It’s good to see them being so proactive at fixing the issue, although I wonder if it’s because they realised it was their fault?

However, being as I am a sysadmin, I figured I could save everyone a lot of time and hassle if I did the upgrade myself. Sun gave me the link and a couple of hours later it was all sorted (and I even managed to shift some furniture around the building in the middle of it). We’ve agreed with Sun to wait until next week to ensure things are working as they should, then we’ll close the case.

In the meantime I have our largest volume resyncing, but have some familiar looking problems cropping up with the other. Somehow I fear this saga isn’t quite over yet…

