Jump to content
  • 0

Drives dropping out and will not stay connected


eman31

Question

Hi all. Longtime reader, first time poster :-)

 

My server developed a problem last week where it is constantly dropping drives and I can not get the pool measured to completion to figure out what is causing the problem. I've always had an issue with occasional dropped drive but it's always been a momentary thing and corrected itself immediately but this time it's numerous drives that are falling from the pool.

 

After some research it seems that the most common things that can cause this are the controller going bad, cables, power supply or damaged drives. I'm not getting any SMART errors showing on the scanner but it's also getting constantly interrupted by the drops. I don't seem to have lost any data so I'm thinking it's not a drive issue. I've checked all my cables to make sure that nothing came loose and have ordered some replacements to eliminate that. Now I'm looking into the controller and power supply. I've checked my event logs and the only errors I'm seeing is the drives becoming unavailable. 

 

I have a question about changing the HBA controller. Right now I have two Supermicro AOC-SALSP-MV8 feeding 16 - 4 to 6TB HGST NAS drives and another 4 drives on the motherboard. From what I have been reading it's common to experience dropped drives with these cards so even though I've been running these for a few years without issue, I'm leaning towards that being the main culprit.

 

If I were to swap these out for something along the lines of a LSI - 9201-16i, can I just pull the old cards and plug the current drives in to the new cards without running the risk of losing anything? I only have two PCI-e 16x slots on the board so something is going to have to come out.

 

Any help would be appreciated!

 

A little info on my system

Norco 4220

Corsair HX850 850W PSU

ASRock Z86 Extreme 6

Intel I7 4770 Haswell

16GB DDR3 RAM

2 - Supermicro AOC-SALSP-MV8 controller cards

Samsung 840 500GB SSD for OS

Windows Server2012 R2 Essentials

20 - 4 to 6TB HGST NAS hardrive for storage 102TB total with 17.4 TB free about a 50/50 split between duplicated and raw folders

Drive Pool version 2.11.561

Drive Scanner 2.5.2.3103 Beta

 

   
Link to comment
Share on other sites

7 answers to this question

Recommended Posts

  • 0

Well, StableBit Scanner may help. (right click on the column header and select "by controller").

 

Otherwise, you can use the Device Manager, select "View" and "Devices by connection".  You'll have to "find" the drives, but that should give you a good idea.

 

 

Though, I think this was resolved in a ticket, but not 100% sure.

Link to comment
Share on other sites

  • 0

Well, StableBit Scanner's burst and ping tests may be a good way to stress test the controller and cables. 

 

You're using a Norco case here.... the chances are, that you have a bad backplane here. (the part that hooks up to the hard drives. 

 

Norco is known for their issues, due to using substandard parts.   I'd be inclined to blame these first and foremost over the cards you have. 

 

 

 
That said, assuming you're not using the Supermicro card for RAID, you should be able to swap out the cards without any issues.
The card you've linked (the LSI one) is an HBA card, meaning it's only a controller card, no RAID functionality. So it passes the disk through to the OS directly. 
 
And it's a fantastic card. BTW.   Additionally, you'd only ever need that HBA, as you could get SAS expander cards and support up to 80 drives with 4 expanders.  (check out the Intel SAS Expanders). 
Link to comment
Share on other sites

  • 0

Thanks for the reply and the info on the LSI card. I'm using the Supermicro cards as a pass through with no raid and I thought I might just be able to switch them.

 

I tried a burst test but the longest I've been able to keep it running so far is 6 minutes and that's not giving me enough data to go on. 

 

I didn't consider the backplane on the Norco. I've been using it since I built my original WHS server back in 2010 so it and the supermicro cards are pushing past 6 years of use. I'll have to figure out a way to troubleshoot that. I'm usually pretty good at figure issues like this out but this one has me a little stumped. 

Link to comment
Share on other sites

  • 0

So it's definitely erroring out on the burst test? 

 

If so, it should give a specific error, though that may not be too helpful.

 

however, the Event Viewer (run "eventvwr.msc") may indicate what exactly happened at that point in time. 

 

 

As for the Norco, I had the same case.  I had some really, really weird issues.  Like certain rails wouldn't work on certain ports.  Or certain ones wouldn't work with the reverse breakout cable I was using, but others had no issue. 

 

Moving things around may be a good way to test things out, but it's a PITA.  

 

 

Also, another thing to check is the onboard tests.  Download SeaTools or WD Data Lifeguard tools and run the on-disk tests.  If the drive has issues, then it's a disk issue actually.   But if it completes properly, then it's a communication issue (controller, cable, backplane).

Link to comment
Share on other sites

  • 0

I'm trying a burst test again now and it's been running for about 13 hours. In that time, 8 of the 20 disks have dropped out of the pool while testing but I haven't gotten any error messages when they do.

 

Out of the 12 that are left, the four on my motherboard are still connected and not showing any errors so far. I need to check the to see if the other 8 that are connected are on on one controller or spread out between the two. I've been moving things around and lost track of which disks are on which Supermicro controller.

 

I've tried the event viewer but it's not giving me any good data. It does give an error of event 15 on disk when one drops but that is just telling me that a drive is disconnected.

 

I haven't had much time to dig into the case yet. Being an older one, I'm not sure how the backplanes are attached and if they can be individually changed. 

Link to comment
Share on other sites

  • 0

I finally was able to do some serious troubleshooting this week end and after changing out all the cables from the backplate to the controller and motherboard I'm cautiously optimistic the issue is resolved. I did lose a couple of very small files but nothing major. I don't know if maybe there was a power surge sometime recently that fried something out or if they all just started going bad  I went ahead and threw a new beefier UPS on the server to be on the safe side.

 

I appreciate all the tips and guidance!

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...