aje112

Christopher (Drashna) · April 24, 2015

I'm going to jump the gun and post my conclusion here: my issues were most likely caused by my add-in controller, even after installing its driver. I haven't yet run the burst tests, but here is a list of events that point to it:

When I first built the pool and relocated some pooled drives onto the new controller, those exact drives began to exhibit file access issues. I didn't mention this when I originally posted because I didn't notice issues immediately after doing so.
Removing those same drives and reading them with my USB HDD dock exhibited no access issues (nor corruption).
When formatting a DIFFERENT set of drives on the suspect controller, the formats would freeze. This was attempted three times.
After flashing the firmware, the formats appear to be progressing normally and are further along. (To be clear, updating the DRIVER seemed to make no difference.)
Currently, my pool is rebuilt from drives using the onboard controller only. There are no I/O issues.

I could have done more to isolate the controller as the point of failure, but at this point, noticing how everything works perfectly fine until it's connected to the Rocket 640L is good enough for me. I guess by shopping for a cheaper card, I paid the price. So far, the fix appears to be to flash the stock firmware. For anyone looking, the Rocket 640L (not to be confused with the RocketRAID 640L) uses the Marvell 88SE9230 controller.

I've started a separate pool for the controller to verify controller stability before I let that controller contribute to my main pool. Also, burst tests will be underway soon.

My thanks to Christopher for his patience and guidance. To show my appreciation, I've tried to be thorough in this thread in hopes that anyone who runs into a similar problem with the same card finds this thread.

- AJ

April 22, 2015

If that's the case, then definitely check the connections and try swapping out the cables.

Also, check your power supply. What's the wattage rating on it, ... and what hardware are you using on it?

I actually replaced all the cables. While I was pulling the drives, a wave of OCD hit and I decided to finally get the drives' cables on some consistent color scheme.

The PSU is a 450W Seasonic that powers the i3, 8 drives, the stock HSF, and 6 fans. Pulls less than 55 watts under load. I read somewhere to allocate about 10W per drive so I'm thinking I've got plenty of leeway.

As for the firmware, it's always a good idea to check when you're having issues. Just in case. Same with the controller card in fact.

Will do, especially for the HighPoint card. I reckon I'll go ahead and install the drivers for it too.

Once I get the pool up and running, I'll follow-up with some comments on the burst test results and rebuilt pool stability before the end of the week. I really appreciate the time and direction you've given me here. I really want the server to be as stable as possible and you've went well beyond DrivePool to help me think about some things to achieve just that. Thanks!

April 21, 2015

As for the UI crashing, could you check the "Application" section of the Event Viewer (run "eventvwr.msc") and see what caused the crash? This may help identify the issue.

Saw nothing in the application log specific to the UI. It was the first time the UI ever crashed on me so I'm not terribly concerned about it.

While you're there, check the "System" section for disk or controller, or related errors. It may be throwing errors up here.

Tons of warnings:

storahci: Reset to device, \Device\RaidPort1, was issued.
disk: An error was detected on device \Device\Harddisk2\DR2 during a paging operation.
Ntfs (Microsoft-Windows-Ntfs): The system failed to flush data to the transaction log. Corruption may occur in VolumeId: G:, DeviceName: \Device\HarddiskVolume6. (A device which does not exist was specified.)

I have no idea how the drives are mapped so I don't know which disks "Harddisk2" or "RaidPort1" refer to. However, regarding the Ntfs error, the drive I assigned to G was definitely one of my trouble drives.

However... at this point, I would recommend running a (deep) memory test.

I ran a couple passes with Memtest86 and they came back with no errors.

StableBit Scanner has a "burst test" option (if you right click on the drive), that is very useful for this. It's recommended to run for at least 24 hours, though.

I'm going to go ahead and run the tests on the drives connected to the HighPoint controller.

If they were/are connected to the HighPoint card, try connecting them to an onboard controller and see if they exhibit the same issues. If the issue doesn't manifest on the different controller, then you have your culprit.

Wish I'd done this from the get-go. As posted earlier, the suspect drives gave me no incidents when I connected them to my USB dock to recover the files. I suppose strictly speaking, the files needed no "recovery"! I'm staring pretty hard at the controller now, which is why I'm running the burst test on the batch of drives I have connected to it before I explore other options.

In this case, it may be a defect in the firmware (maybe a race condition triggered somehow).

I doubt the possibility of firmware being the culprit because this happened on both a 5 TB Toshiba (MD04ACA500)

and 5 TB WD Red (WD50EFRX). A quick Google search doesn't seem to suggest firmware issues with either drive, but if your experience says otherwise I'm all ears. Also, my laziness is kicking in at this point (after running a memory test, full formats, replacing cables, recovering and rebuilding the pool!); I would like to avoid updating firmware given the higher likelihood that the controller is the point of failure.

Also, from a cable management point, a RocketRAID 2721 (or 2720SGL) are a better purchase. They use SAS cables, which are a bit more expensive, but make cable management a lot simpler.

Thanks for this suggestion. If the burst test comes back with damning evidence against the HighPoint, this will be the first thing I'll do.

Again, thanks for the responses.

April 21, 2015

However, if the drive is having issues with removal, and it's causing the system to hang.... this generally indicates an issue with the disk itself.

I originally thought this as well, until I found a different drive that exhibited the same problem. Specifically: when a file was accessed via the server or a networked computer, it would cause the entire pool to freeze. It was natural for me to implicate the pool in this case, until you mentioned the below...

However, it's possible that a virus scanner could be interfering (it shouldn't based on how the software works, but weirder things have happened...) or it could be an issue with the controller in use.

Perhaps - if you have any ideas on testing the controllers, I'm definitely game for that. My SuperMicro board has 6 ports and a PCIe card provides another 4. It's worth noting that the two problem drives (that I found - could have been more) were both connected to the PCIe controller.

If both disks are/were connected to the same controller, it could be an issue with the controller (and this could definitely cause the system to become unstable.

Which they were. This is a definite possibility, given that once the drives were connected to my computer via a USB HDD dock, they worked just fine. Looking back, I wish I didn't already format the 2 drives and instead connected them to my server's mobo SATA ports. This will be my first troubleshooting step if I have another round of this.

What I still can't wrap my mind around is why specific files were triggering these incidents. I could reproduce (and even avoid) the freezing without subtracting the drive from the pool. I could even continue to read and write to it with limitations. That said, I understand that this specificity doesn't rule out a controller issue.

As for the UI crashing, could you check the "Application" section of the Event Viewer (run "eventvwr.msc") and see what caused the crash? This may help identify the issue.

While you're there, check the "System" section for disk or controller, or related errors. It may be throwing errors up here.

As soon as I'm done with pulling the files off, I'll be sure to take a peek.

As for the disks and moving files manually triggering the issues ... that definitely sounds bad.

What controller are you using?

SUPERMICRO MBD-X10SLM+-F-O and Rocket 640L. I'm pretty new to controllers and this Rocket is the first I've ever installed. It booted fine upon installation with no incident. I didn't install the drivers it came with because I didn't notice any problems with disk recognition and reading/writing initially. I'll do so once I'm up and running again.

Thanks for your feedback!

April 21, 2015

(edited for formatting)

The forced drive removal option stuck at 0.5% (a 5 TB drive with maybe 20% utilization) for about 10 minutes, then the GUI crashed. I believe I'll have to skip to removing the drive the old-fashioned way.

-

UPDATE:
I was able to start pulling files off of the drive after removing it from the server. The plan is to do this to all of my pooled drives, format them, then rebuild the pool from scratch.

I still have no idea what may have been the cause of my read/write issues. I'm just glad to recover data without the locking up. It's also worth mentioning that at least one problem file that would freeze the pool is fully intact and readable after I pulled it out of the server, so there's no evidence of data corruption. Also, the drives continue to show no SMART errors other than the ones already mentioned.

April 21, 2015

The absolute best luck I've had is with SpinRite, but it isn't free, isn't fast, and I think I read somewhere that it had compatibility issues with drives larger than 2TB, but I can't swear by that one.

Anyway, if you're only looking to get back your non-important stuff, it may not be worth it.

Thanks for the tip. I researched it a bit and discovered that the 2 TB limitation is due to FreeDOS. I could probably figure out how to make an MS-DOS bootable with SpinRite pretty easily, but I'm going to see if I can just yank the drive and use my HDD docking station to pull files off of it. I actually discovered this same problem (reading a file which causes the drive to lock up) on a DIFFERENT drive that was a part of the pool, and I'm reluctant to believe that both drives took a dump and need to be recovered this way.

My replies to Christopher bolded below:

Cancelling the removal should[n't] affect the pool at all. Rebooting... can (but that's true of EVERY disk).

I didn't think so either. I'm chalking it up to a coincidence.

Specifically, when a disk is removed we "balance" files off of the drive. And when we move or copy files, we use a "copytemp" file until we're sure it moved. Once it has, we delete the old file (if necessary) and rename the new file. This way, there is very little chance of corruption if something goes wrong (data integrity is our #1 priority).

This is reassuring and, thanks to your reputation and my impression of DrivePool's quality, I have full confidence in your statement.

As for the issues you describe, it definitely sounds like something is wrong.

Do you have any antivirus installed (as sometimes they can cause issues)?

And could you do this: http://wiki.covecube.com/StableBit_DrivePool_Q2159701

I don't run AV on my server (pretty much a LAN-only machine that spends the majority of its uptime backing up and serving files). Following Q2159701 returns: WdFilter, luafv, npsvctrig, FileInfo, Wof. If you spot any suspects for causing my weird access issues, let me know!

Also, does Scanner indicate any SMART errors on one or more drives?

Yes, for one of my Seagates (reallocated sectors and spin retries, but these counts have been stagnant for months). However, I've noticed the pool dying from accessing files from two other drives that aren't throwing errors.

Another thing to do is the "forced damaged drive removal" option. This will skip problem files and just try to remove the disk. This may be what you want to do.

I'm going to go this route and see if it'll get me better luck with yanking the files off the drive. If not, I'm going to remove it and use my HDD dock on my other computer and see if that helps.

Alternatively, try using the "Disk Usage Limiter" balancer, uncheck all the options for the disk in question, and let it sit. This will try to actively balance the data off of the disk in question (depending on your balancing settings).

I suspect this may not work, as certain files from the DrivePool trigger the whole pool to "die" (as described earlier: the server itself, along with my 2 computers that have mapped the pool, will crash as if waiting for the pool to become responsive). I've witnessed this same behavior when I attempt to move the files directly from the drive's hidden PoolPart directory. In case I didn't mention it yet, I've noticed this on TWO drives instead of the one I originally saw this behavior on.

I'll post up with the results of my attempts to recover my files from the drives. Thanks!

April 20, 2015

Good morning everyone. When I try to remove a drive, the program seems to work as it shows a burst of I/O up until it hits 0.2%. At that point, it sticks and there's no read/writes being done. I let it stay in this state for at least 30 minutes before rebooting, upon which the drive is still reflected in the pool with no apparent complaints

Earlier this weekend, I attempted to remove this same drive and cancelled it at about 0.7% (yes, it seemed to be progressing normally) because of time constraints. The abort didn't resolve after about an hour so I rebooted, started DrivePool back up, and let it resolve some (duplication?) errors. It seemed to carry on just fine, with the drive still reflected as part of the pool.

I'm new to pooling drives and I'm concerned that my impatience on the first removal attempt corrupted the pool somehow. There are a couple other issues that have arisen since my aborted removal:

Accessing SPECIFIC files from the pool via a networked computer will cause the accessing program, then Explorer on that computer, to crash.
Constant writes from uTorrent (maybe 5+ simultaneous torrents both reading and writing on each) after a few minutes causes the pool to become non-responsive.
When the pool becomes non-responsive, both networked computers that have the drive mapped (my box and my HTPC) will crash (i.e. if I "induce" a crash via uTorrent from my computer, my HTPC which was streaming from my server that runs DrivePool will follow suit).

Any ideas? I'm thinking about pulling the pooled files from the drive's hidden PoolPart directory, then forcing a drive removal, then re-adding them to the remaining pool, and finally re-adding the drive (EDIT: after a format, of course).

EDIT: I concurrently run StableBit Scanner, which doesn't indicate any issues. Then again, I'm not sure if it monitors pool health as much as it does SMART metrics.

---

UPDATE:

I attempted to move the files from the drive's PoolPart and it just fell flat. Canceling the stuck move crashes Explorer, so this pretty much implicates the drive and not DrivePool. It's beyond of the scope of this forum, but if anyone has any suggestions for recovering the files on this drive, I'm game. My important files are already duplicated by the pool and are backed up off-site via CrashPlan. Thanks!

Sign In

aje112

Posts

Joined

Last visited

Days Won

Content Type

Profiles

Forums

Posts posted by aje112

Corrupted Pool?

Corrupted Pool?

Corrupted Pool?

Corrupted Pool?

Corrupted Pool?

Corrupted Pool?

Corrupted Pool?

Browse

Activity