“I’ll build a new server; it’s got to be easier than patching up the old one…”

“I’ll build a new server; it’s got to be easier than patching up the old one…”

A few weeks back I started having problems with my file server at home. This machine is fairly important to us; it holds all our photos, music and other files. For years I’ve been bodging it together with various old parts scavenged from other machines and some new parts when needed. But, once again, it’d started to break. Disks were dropping out of the RAID unexpectedly, and the replacements were refusing to rebuild. Unsure of where the problem was I uttered the fateful words “I’ll build a new server; it’s got to be easier than patching up the old one…”. My colleagues were sceptical, but I ploughed on anyway. Maybe I should have listened to them?

It took the best part of a week to work out what I wanted. There were so many decisions to make: which RAID card, disks, motherboard, CPU, RAM, case, etc. I researched each one as much as I could, but there’s a bottomless pit of information on the Internet. Eventually I settled on a 3ware 9690SA RAID card with 4 Seagate ST31000340NS disks. The other bits were fairly decent to make sure the machine would have a good life, but not excessive.

The reason for choosing a hardware RAID solution over software RAID was simple – reliability. Now, I’m not knocking software RAID in principal (look at ZFS, for example), but the implementations for RAID 3-5 on FreeBSD aren’t great (yes, it has ZFS, but I’m not in the mood for trailblazing this time round). I wanted to stick with FreeBSD so I opted for the well known reliablity that 3ware cards provide. And the 5 year warranty on Seagate disks made them an attractive choice.

The purchasing process wasn’t as simple as it could have been. I ordered from dabs.com, span.com (they specialise in storage stuff) and overclockers.co.uk. I’ve used all three companies before, so I wasn’t too concerned about problems. The bulk of it was ordered from Dabs – it looks like they’re back to being competitive on prices. The problems started almost immediately; Dabs held my order over an issue with my address. It’s happened once before and that put me off Dabs for some time, but we use them all the time at work, so I had hopes they’d be better now. It took a working day to resolve that issue… and then next day I get an email to say my credit card company has declined the order. On the phone to them and through to their security department; seems buying lots of stuff online is unusual… not for me it isn’t. Anyway, that was resolved and then I had more waiting for Dabs to try the transaction again. Eventually I got impatient and tried their online chat thing and the matter was resolved in minutes. Meanwhile the parts ordered from the other two suppliers were sitting on my desk.

Eventually it all arrived and I took it home. Ruth wasn’t overly impressed when I cleared off the dining room table and covered it in computer parts, but I assured her it wasn’t for long. That was a couple of weeks ago – it’s all still there.

I spent a weekend putting things together and testing it all out. I routed every cable neatly and tied them carefully to the case to ensure nothing moved about. Airflow was good and the additional fans in the case were doing a great job of keeping things cool (not sure about their blue LEDs though…). All was looking good and I was enjoying the process.

Then I tried to use the RAID card. The first problems hit when I turned on the motherboard’s RAID, which I’d intended to use to mirror the system disks, whilst the 9690SA was plugged in. I’d gone for a Asus P5E3 and expected both RAID systems to work happily together, but sadly I was wrong. I experienced unusual problems such as the machine hanging on the Intel Matrix Storage (the onboard RAID) screen and disks randomly disappearing from both arrays. In the end I gave up and turned off the onboard RAID; I figured the FreeBSD RAID 1 (gmirror) is pretty solid, so I’d use that.

Thinking I’d got over the worst of the problems I moved on to setting up the 9690SA. Things looked good for a while; the interface was clear and everything was easy to set up. It wasn’t until I started trying to put data on that I noticed problems. Here’s a snippet from the error log (largely for the benefit of Google):

E=0200 T=08:26:00 : Cable CRC error
SATA Device. port = 0x0
task file written out : cd dh ch cl sn sc ft
                      : 00 70 00 00 00 1200 00
  task file read back : st dh ch cl sn sc er
                      : 00 00 00 00 00 8441 00
E=0200 T=08:26:00 P=0h: Soft reset drive
E=0200 T=08:26:00 P=0h: exitCode = 1013
Port retry not allowed
E=0200 T=08:26:00 P=0h: Prepare for command retry
exitCode = 1013

At first I wasn’t sure what to make of this. Maybe it was the cable or connection, but on all four drives? It was a special 4-in-1 (SFF8087) cable, but it still seemed odd. I logged the case with 3ware’s technical support and got back a response suggesting I try another cable. Well, duh, I could have figured that myself. I was hoping they might be able to point out any other less obvious potential causes.

So, I purchased another cable. It took a couple of days to arrive and did absolutely nothing to resolve the problem. Sigh. At the same time as this was going on I had another problem – it’s only with hindsight that I know to separate the two:

E=0204 T=18:34:36     : Port timeout (ext)
SATA Device. port = 0x2
task file written out : cd dh ch cl sn sc ft
                      : 00 04 00 00 00 00 00
Send AEN (code, time): 0x9, 06/10/2008 18:34:36
Drive timeout detected
(EC:0x09, SK=0x04, ASC=0x00, ASCQ=0x00, SEV=01, Type=0x71)
phy=6
  task file read back : st dh ch cl sn sc er
                      : 00 00 00 00 00 00 00
E=0204 T=18:34:36 P=2h: Soft reset drive
E=0204 T=18:34:36 P=2 : Inserting Set UDMA command
E=0204 T=18:34:36 P=2h: Check power cycles, initial=40, current=40
E=0204 T=18:34:36 P=2h: exitCode = 1013
Port retry not allowed
E=0204 T=18:34:36 P=2h: Prepare for command retry
exitCode = 1013
E=0204 T=18:34:36 U=0 : Retrying command

These errors happened less frequently, but eventually caused I/O to hang and the controller to reset. Again I logged this with 3ware’s technical support and got back a bunch of not so helpful responses. They suggested moving the card in the machine, testing the disks, checking the power supply, and so on. All valid points, but what annoyed me was they could only ask me to check one at a time… and they could only reply to me once a day. Plus I’d already done everything they suggested. It took a week to go through this nonsense.

In the mean time I spent a lot of time experimenting, fiddling, and web searching. Eventually I found the following two pages, although it took me a while to realise their significance:

https://www.3ware.com/3warekb/article.aspx?id=15385
https://www.3ware.com/3warekb/article.aspx?id=15171

The first of the articles explicitly mentions my controller card and drives, so it seemed to be the right thing to do. But I had the SN04 firmware on my drives and they wanted me to apply AN05. I asked both 3ware and Seagate to clarify the differences, but neither gave satisfactory answers. Seagate managed to give me the SN05 firmware to try, but it didn’t help. In slight desperation, and without anyone giving me much help, I decided to take a punt on the AN05 firmware.

IT WORKED!

There was a lot of tension for the next few hours whilst I continued testing, but eventually I was satisfied that the AN05 firmware solved the problem. Later attempts to clarify with Seagate why SN05, which they gave me, didn’t work and AN05, which 3ware pointed me at, did work, got nowhere. Seagate support actually admitted that they basically don’t know.

So on to the next issue. The second article suggested limiting the speed of the drives to work around the drive timeout issue. It’s definately a workaround, but it was worth a shot. I’d already removed the jumpers from the drives that limited them to 1.5 Gb/s, and they were a nightmare to do – I’ve never seen such small and fiddly jumpers on a disk… it was completely unnecessary given the available space. This time I decided to do the limiting in the 9690SA’s software.

ONCE AGAIN, IT WORKED!

So at this point I’m happy. Things are looking good. That last fix is definitely a workaround, and I’ve told 3ware they need to fix it. It’s a bug, and bugs need fixing. I’m now using the array to store my data on, it’s nice and quick (a 512MB write cache helps!), and I have plenty of space. And Ruth might get the dining room table back soon… assuming I can work out how to lift this massive machine (did I mention the case was quite big?).

But I’d like to finish this post with a rant. It turned out that the solutions to my problems were both in the 3ware knowledge base. Now maybe I should have searched harder initially, but it took me some time to find these articles. But more to the point, 3ware support should definitely have known about these issues and should have directed me straight to them. I wasted a week of my time messing around with them, and I’m not happy about it. The card is great (apart from the aforementioned bug), but the support sucks. It will seriously make me think twice about going with 3ware again.

I hope this post will fill in the whole story to those I’ve been ranting at recently, and maybe it’ll help someone else on the Internet out if/when they hit the same problem. That’s assuming they can read this lengthy post in less time that in takes to figure out the solution themselves ;-).

Good night.

(Visited 1,822 times, 1 visits today)
Share

16 Comments

  1. With the machine shut down I moved the drives to another machine to update the firmware. This way the RAID controller doesn’t notice them disappear. I suppose you could pull the RAID controller out instead, and then connect each drive to the motherboard for the updates.

    No nice solution I’m afraid!

  2. Sinan

    Hi,

    Kinda old topic but still interesting.

    How did you update the firmware of the drives? As the drives are behind the RAID controller the firmware updater will not recognize the drives, right?

    I am running Linux.

    Your help is appreciated. Thanks!

  3. Ricardo

    Hi Tim Bishop, thanks for answer.

    Well, I just see the differencce on benchmarks,especially in random jobs ( HD tune). The only major difference that I see on real life is an crysis installation (may be it’s a mix IO that 3ware claims the best performance in intelligent mode)
    With basic read: 02:10m installation time
    With Intelligent: 01:56m

    Although, on every windows usage, some times the lack latency of intelligent mode is visible

  4. Ricardo

    Good night at everybody.
    I’m using a 3ware 9690SA-8i too with 2x Corsair F120 SSD in Raid 0, running fine but not at the full performance because this controller is not made for SSD’s. Although, it doesn’t run very well with single OCZ vertex turbo, ok don’t matter,I’m not worry about this.
    Just for Corsair F120, Tim Bishop what specs you recommend for windows user under 3ware 9690? I’m using BBU + Storsave profile in balanced mode(that makes a big rule on smooth and performance), but I can’t decide for Basic read cache or Intelligent…it’s very difficult to see the difference between both…
    Thanks for everything,
    Regards from Ricardo, Portugal

  5. Hi Terje,

    I don’t get any timeouts doing that dd. Have you limited the drive speeds to 1.5Gbps? That fixed the timeouts for me.

    Also, have you changed the cables?

    Finally, are you running the latest firmware for the 9690SA? I’m not, but it may or may not help.

    You could try 3ware’s support people. They’re good at getting back to you, but not always so good at being helpful…

    Tim.

  6. Terje Marthinussen

    Hi,

    I bought an 9690SA the other day. While it is “working”, if I put it under some stress, I get timeouts no matter what settings I try, firmware or drivers.

    The timeouts only seem to happen on read (at least it does not log any on writes, which does not really prove anything) and to 3ware’s defense, they seem to handle timeouts pretty well. Do not drop drives easily (like for instance areca does).

    I have now tried 3 different disk models, and all fails. 2 WDC and one samsung. I have also tried 2 G.SKILL Titan SSDs, but the 9690SA does not detect them at all. I plan to try som old seagates just to test.

    The 2 WDCs both run without any fuzz on my ARC-1220, the samsung run fine in raid1 on the ARC-1220, but it does not like raid5 for some reason (the samsungs will drop from the raid under a bit load).

    All of these drives produce timeouts on the 9690SA no matter raid type, connection speed, queueing policies or other things, but raid1 seem to generate less timeouts.

    Can you tun a simple test on you setup and see what happens?

    The test I do is just:
    dd if=/dev/sda1 of=/dev/null bs=1024k count=32000; dd if=/dev/sda1 of=/dev/null bs=1024k count=32000; dd if=/dev/sda1 of=/dev/null bs=1024k count=32000

    and it is almost certain to generate at least a couple of timeouts within doing this twice.

    Yes, that is a simple dd that reads 32GB from the raw device. It is repeated 3 times.

    You could probably just read 100GB instead in one dd.

    I am starting to believe that there is something really bogus with this controller, it is just that it takes a fair bit of I/O to trigger it, and when it happens, it handles the timeouts so gracefully that most people never notice unless they actually read the logs.

    This controller clearly have very good timeout handling, but I would rather be without the errors in the first place.

  7. Clemens Rossell

    Thanks so much for blogging about this issue! I found it through a Google search, after running mad trying to figure out why my RAID5 array keeps going into a degraded state. I also am using the 3Ware 9690SA-4I card, however, I’m using Seagate 7200.11 drives (very similar to ES.2 model). They came with SD15 firmware. I’ve contacted 3Ware and they said I need to replace my cables and reseat my card/drives (which obviously is not the issue). I really hope someone at 3Ware is investigating this issue, other Google searches have found lots of folks have this issue and Seagate tells me the SD15 firmware is the latest for the model drive I own. So annoying. Nobody seems to be looking into this issue from either company.

  8. mattias

    I have similar issues with a ST31000340NS drive with SN04 firmware (CRC-errors and timouts). I’ve found a download link for the AN05 version over at Adaptecs Knowledge base, but have been unable to find version SN05.

    Maybe you could mail me a the link too?

  9. I’ve had contact from someone else who’s having similar issues with eight 7200.11 drives connected to the same card. Dropping the speed to 1.5Gb/s reduced his timeout problems (under load), but didn’t completely resolve them. He also gets CRC errors but so far hasn’t found a firmware that resolves the issue.

    It looks like this problem might resurface one day 🙁

    Come on 3ware and/or Seagate, pull your fingers out!

  10. kam

    Hello,

    I just found your blog in google.

    I had purchase two ST3500320NS recently and they are with SN04 firmware.

    Someone told me I need to update to SN05 from SN04 first, then update from SN05 to AN05.

    However, I only got AN05 download link but I do not have SN05 download link.

    May you please share the seagate SN05 firmware link to me ?

    Or sent it to my email ?

    Thanks

  11. pao

    If I used the dining table mrspao would kill me.

    Interestingly when you were speccing the parts and said you were going for the 3Ware card – your reasoning to me was that they actively did FreeBSD drivers and their support was good.

    Still the good thing is that it is sorted and working, now you need to get a crane to shift it.

Leave a Reply

Your email address will not be published. Required fields are marked *