Corrupted Pool?

aje112 · April 20, 2015

Good morning everyone. When I try to remove a drive, the program seems to work as it shows a burst of I/O up until it hits 0.2%. At that point, it sticks and there's no read/writes being done. I let it stay in this state for at least 30 minutes before rebooting, upon which the drive is still reflected in the pool with no apparent complaints

Earlier this weekend, I attempted to remove this same drive and cancelled it at about 0.7% (yes, it seemed to be progressing normally) because of time constraints. The abort didn't resolve after about an hour so I rebooted, started DrivePool back up, and let it resolve some (duplication?) errors. It seemed to carry on just fine, with the drive still reflected as part of the pool.

I'm new to pooling drives and I'm concerned that my impatience on the first removal attempt corrupted the pool somehow. There are a couple other issues that have arisen since my aborted removal:

Accessing SPECIFIC files from the pool via a networked computer will cause the accessing program, then Explorer on that computer, to crash.
Constant writes from uTorrent (maybe 5+ simultaneous torrents both reading and writing on each) after a few minutes causes the pool to become non-responsive.
When the pool becomes non-responsive, both networked computers that have the drive mapped (my box and my HTPC) will crash (i.e. if I "induce" a crash via uTorrent from my computer, my HTPC which was streaming from my server that runs DrivePool will follow suit).

Any ideas? I'm thinking about pulling the pooled files from the drive's hidden PoolPart directory, then forcing a drive removal, then re-adding them to the remaining pool, and finally re-adding the drive (EDIT: after a format, of course).

EDIT: I concurrently run StableBit Scanner, which doesn't indicate any issues. Then again, I'm not sure if it monitors pool health as much as it does SMART metrics.

---

UPDATE:

I attempted to move the files from the drive's PoolPart and it just fell flat. Canceling the stuck move crashes Explorer, so this pretty much implicates the drive and not DrivePool. It's beyond of the scope of this forum, but if anyone has any suggestions for recovering the files on this drive, I'm game. My important files are already duplicated by the pool and are backed up off-site via CrashPlan. Thanks!

airjrdn · April 20, 2015

The absolute best luck I've had is with SpinRite, but it isn't free, isn't fast, and I think I read somewhere that it had compatibility issues with drives larger than 2TB, but I can't swear by that one.

Anyway, if you're only looking to get back your non-important stuff, it may not be worth it.

Christopher (Drashna) · April 20, 2015

Cancelling the removal should affect the pool at all. Rebooting... can (but that's true of EVERY disk).

Specifically, when a disk is removed we "balance" files off of the drive. And when we move or copy files, we use a "copytemp" file until we're sure it moved. Once it has, we delete the old file (if necessary) and rename the new file. This way, there is very little chance of corruption if something goes wrong (data integrity is our #1 priority).

As for the issues you describe, it definitely sounds like something is wrong.

Do you have any antivirus installed (as sometimes they can cause issues)?

And could you do this: http://wiki.covecube.com/StableBit_DrivePool_Q2159701

Also, does Scanner indicate any SMART errors on one or more drives?

Another thing to do is the "forced damaged drive removal" option. This will skip problem files and just try to remove the disk. This may be what you want to do.

Alternatively, try using the "Disk Usage Limiter" balancer, uncheck all the options for the disk in question, and let it sit. This will try to actively balance the data off of the disk in question (depending on your balancing settings).

Also, if you're able to get the files off of the disk manually, then you can then format or physically remove the disk from the system. Then you can remove the now "missing" disk from the pool, without any issues.

The absolute best luck I've had is with SpinRite, but it isn't free, isn't fast, and I think I read somewhere that it had compatibility issues with drives larger than 2TB, but I can't swear by that one.

Anyway, if you're only looking to get back your non-important stuff, it may not be worth it.

SpinRite is a good choice.
But yes, it does have issues addressing drives larger than 2TBs. This is due to a limitation in the version of FreeDOS that he's using for the boot disk. He's aware of the issue and working on a fix (for a while now).

http://codeverge.com/grc.spinrite/spinrite-can-only-see-1.8-tb/1651452

aje112 · April 21, 2015

The absolute best luck I've had is with SpinRite, but it isn't free, isn't fast, and I think I read somewhere that it had compatibility issues with drives larger than 2TB, but I can't swear by that one.

Anyway, if you're only looking to get back your non-important stuff, it may not be worth it.

Thanks for the tip. I researched it a bit and discovered that the 2 TB limitation is due to FreeDOS. I could probably figure out how to make an MS-DOS bootable with SpinRite pretty easily, but I'm going to see if I can just yank the drive and use my HDD docking station to pull files off of it. I actually discovered this same problem (reading a file which causes the drive to lock up) on a DIFFERENT drive that was a part of the pool, and I'm reluctant to believe that both drives took a dump and need to be recovered this way.

My replies to Christopher bolded below:

Cancelling the removal should[n't] affect the pool at all. Rebooting... can (but that's true of EVERY disk).

I didn't think so either. I'm chalking it up to a coincidence.

Specifically, when a disk is removed we "balance" files off of the drive. And when we move or copy files, we use a "copytemp" file until we're sure it moved. Once it has, we delete the old file (if necessary) and rename the new file. This way, there is very little chance of corruption if something goes wrong (data integrity is our #1 priority).

This is reassuring and, thanks to your reputation and my impression of DrivePool's quality, I have full confidence in your statement.

As for the issues you describe, it definitely sounds like something is wrong.

Do you have any antivirus installed (as sometimes they can cause issues)?

And could you do this: http://wiki.covecube.com/StableBit_DrivePool_Q2159701

I don't run AV on my server (pretty much a LAN-only machine that spends the majority of its uptime backing up and serving files). Following Q2159701 returns: WdFilter, luafv, npsvctrig, FileInfo, Wof. If you spot any suspects for causing my weird access issues, let me know!

Also, does Scanner indicate any SMART errors on one or more drives?

Yes, for one of my Seagates (reallocated sectors and spin retries, but these counts have been stagnant for months). However, I've noticed the pool dying from accessing files from two other drives that aren't throwing errors.

Another thing to do is the "forced damaged drive removal" option. This will skip problem files and just try to remove the disk. This may be what you want to do.

I'm going to go this route and see if it'll get me better luck with yanking the files off the drive. If not, I'm going to remove it and use my HDD dock on my other computer and see if that helps.

Alternatively, try using the "Disk Usage Limiter" balancer, uncheck all the options for the disk in question, and let it sit. This will try to actively balance the data off of the disk in question (depending on your balancing settings).

I suspect this may not work, as certain files from the DrivePool trigger the whole pool to "die" (as described earlier: the server itself, along with my 2 computers that have mapped the pool, will crash as if waiting for the pool to become responsive). I've witnessed this same behavior when I attempt to move the files directly from the drive's hidden PoolPart directory. In case I didn't mention it yet, I've noticed this on TWO drives instead of the one I originally saw this behavior on.

I'll post up with the results of my attempts to recover my files from the drives. Thanks!

aje112 · April 21, 2015

(edited for formatting)

The forced drive removal option stuck at 0.5% (a 5 TB drive with maybe 20% utilization) for about 10 minutes, then the GUI crashed. I believe I'll have to skip to removing the drive the old-fashioned way.

-

UPDATE:
I was able to start pulling files off of the drive after removing it from the server. The plan is to do this to all of my pooled drives, format them, then rebuild the pool from scratch.

I still have no idea what may have been the cause of my read/write issues. I'm just glad to recover data without the locking up. It's also worth mentioning that at least one problem file that would freeze the pool is fully intact and readable after I pulled it out of the server, so there's no evidence of data corruption. Also, the drives continue to show no SMART errors other than the ones already mentioned.

Christopher (Drashna) · April 21, 2015

aje112,

Well, the manual has a lot of good information, and ... I'm good at digging into how stuff works (it's fun for me, and ... I know Drive Extender from WHSv1 like the back of my hand). And I've talked with Alex (the developer) extensively how both DrivePool and Scanner work.

However, if the drive is having issues with removal, and it's causing the system to hang.... this generally indicates an issue with the disk itself. However, it's possible that a virus scanner could be interfering (it shouldn't based on how the software works, but weirder things have happened...) or it could be an issue with the controller in use.

If both disks are/were connected to the same controller, it could be an issue with the controller (and this could definitely cause the system to become unstable.

And you've indicated that you're not running any antivirus, and I don't see any filters that ... well, aren't Microsoft filters, so that looks good.

As for the UI crashing, could you check the "Application" section of the Event Viewer (run "eventvwr.msc") and see what caused the crash? This may help identify the issue.

While you're there, check the "System" section for disk or controller, or related errors. It may be throwing errors up here.

As for the disks and moving files manually triggering the issues ... that definitely sounds bad.

What controller are you using?

aje112 · April 21, 2015

However, if the drive is having issues with removal, and it's causing the system to hang.... this generally indicates an issue with the disk itself.

I originally thought this as well, until I found a different drive that exhibited the same problem. Specifically: when a file was accessed via the server or a networked computer, it would cause the entire pool to freeze. It was natural for me to implicate the pool in this case, until you mentioned the below...

However, it's possible that a virus scanner could be interfering (it shouldn't based on how the software works, but weirder things have happened...) or it could be an issue with the controller in use.

Perhaps - if you have any ideas on testing the controllers, I'm definitely game for that. My SuperMicro board has 6 ports and a PCIe card provides another 4. It's worth noting that the two problem drives (that I found - could have been more) were both connected to the PCIe controller.

If both disks are/were connected to the same controller, it could be an issue with the controller (and this could definitely cause the system to become unstable.

Which they were. This is a definite possibility, given that once the drives were connected to my computer via a USB HDD dock, they worked just fine. Looking back, I wish I didn't already format the 2 drives and instead connected them to my server's mobo SATA ports. This will be my first troubleshooting step if I have another round of this.

What I still can't wrap my mind around is why specific files were triggering these incidents. I could reproduce (and even avoid) the freezing without subtracting the drive from the pool. I could even continue to read and write to it with limitations. That said, I understand that this specificity doesn't rule out a controller issue.

As for the UI crashing, could you check the "Application" section of the Event Viewer (run "eventvwr.msc") and see what caused the crash? This may help identify the issue.

While you're there, check the "System" section for disk or controller, or related errors. It may be throwing errors up here.

As soon as I'm done with pulling the files off, I'll be sure to take a peek.

As for the disks and moving files manually triggering the issues ... that definitely sounds bad.

What controller are you using?

SUPERMICRO MBD-X10SLM+-F-O and Rocket 640L. I'm pretty new to controllers and this Rocket is the first I've ever installed. It booted fine upon installation with no incident. I didn't install the drivers it came with because I didn't notice any problems with disk recognition and reading/writing initially. I'll do so once I'm up and running again.

Thanks for your feedback!

Christopher (Drashna) · April 21, 2015

I originally thought this as well, until I found a different drive that exhibited the same problem. Specifically: when a file was accessed via the server or a networked computer, it would cause the entire pool to freeze. It was natural for me to implicate the pool in this case, until you mentioned the below...

Could be coincidental, but again, hard to tell.

However... at this point, I would recommend running a (deep) memory test. Weird issues like this tend to be memory related. And considering that NTFS caches file access in memory ... it definitely could be the cause.

Perhaps - if you have any ideas on testing the controllers, I'm definitely game for that.

My SuperMicro board has 6 ports and a PCIe card provides another 4. It's worth noting that the two problem drives (that I found - could have been more) were both connected to the PCIe controller.

StableBit Scanner has a "burst test" option (if you right click on the drive), that is very useful for this. It's recommended to run for at least 24 hours, though.

It will detect any read errors with the drive. This can come from the disk itself, the cable connecting the disk or even the controller. So it may be harder to diagnose the issue exactly. But this is a good way to at least start identifying an issue.

Which they were. This is a definite possibility, given that once the drives were connected to my computer via a USB HDD dock, they worked just fine. Looking back, I wish I didn't already format the 2 drives and instead connected them to my server's mobo SATA ports. This will be my first troubleshooting step if I have another round of this.

I'm jumping ahead here a bit ...

If both disks were connected to the Rocket controller, and where having issues, it may be the HighPoint controller. These controllers are on the cheaper side (yes, I know, they're expensive, but for enterprise grade hardware, they super cheap). I've had issues with these controllers periodically dropping disks, and I've had issues with the controller not liking certain hardware (it would error out a LOT if connected to specific backplanes in my case).

If they were/are connected to the HighPoint card, try connecting them to an onboard controller and see if they exhibit the same issues. If the issue doesn't manifest on the different controller, then you have your culprit.

Otherwise, try reseating the cables and swapping out the cables. Maybe you have a loose connection or a bad cable (respectively).

What I still can't wrap my mind around is why specific files were triggering these incidents. I could reproduce (and even avoid) the freezing without subtracting the drive from the pool. I could even continue to read and write to it with limitations. That said, I understand that this specificity doesn't rule out a controller issue.

See above.

As soon as I'm done with pulling the files off, I'll be sure to take a peek.

Once the files are off of the drive, try running a full (not quick) format. If you run into any issues, still... then there may be an issue with the drive. In this case, it may be a defect in the firmware (maybe a race condition triggered somehow).

Check to see if there is a firmware update for the drive. If there is, reformat again and see if it fixes it.

Otherwise, .... maybe RMA these drives.

SUPERMICRO MBD-X10SLM+-F-O and Rocket 640L. I'm pretty new to controllers and this Rocket is the first I've ever installed. It booted fine upon installation with no incident. I didn't install the drivers it came with because I didn't notice any problems with disk recognition and reading/writing initially. I'll do so once I'm up and running again.

See above, and then this:

Well, for these controllers... you pretty much always want to get the RocketRAID version. More features and better hardware. But they don't' really mention that.

Also, from a cable management point, a RocketRAID 2721 (or 2720SGL) are a better purchase. They use SAS cables, which are a bit more expensive, but make cable management a lot simpler.

aje112 · April 21, 2015

As for the UI crashing, could you check the "Application" section of the Event Viewer (run "eventvwr.msc") and see what caused the crash? This may help identify the issue.

Saw nothing in the application log specific to the UI. It was the first time the UI ever crashed on me so I'm not terribly concerned about it.

While you're there, check the "System" section for disk or controller, or related errors. It may be throwing errors up here.

Tons of warnings:

storahci: Reset to device, \Device\RaidPort1, was issued.
disk: An error was detected on device \Device\Harddisk2\DR2 during a paging operation.
Ntfs (Microsoft-Windows-Ntfs): The system failed to flush data to the transaction log. Corruption may occur in VolumeId: G:, DeviceName: \Device\HarddiskVolume6. (A device which does not exist was specified.)

I have no idea how the drives are mapped so I don't know which disks "Harddisk2" or "RaidPort1" refer to. However, regarding the Ntfs error, the drive I assigned to G was definitely one of my trouble drives.

However... at this point, I would recommend running a (deep) memory test.

I ran a couple passes with Memtest86 and they came back with no errors.

StableBit Scanner has a "burst test" option (if you right click on the drive), that is very useful for this. It's recommended to run for at least 24 hours, though.

I'm going to go ahead and run the tests on the drives connected to the HighPoint controller.

If they were/are connected to the HighPoint card, try connecting them to an onboard controller and see if they exhibit the same issues. If the issue doesn't manifest on the different controller, then you have your culprit.

Wish I'd done this from the get-go. As posted earlier, the suspect drives gave me no incidents when I connected them to my USB dock to recover the files. I suppose strictly speaking, the files needed no "recovery"! I'm staring pretty hard at the controller now, which is why I'm running the burst test on the batch of drives I have connected to it before I explore other options.

In this case, it may be a defect in the firmware (maybe a race condition triggered somehow).

I doubt the possibility of firmware being the culprit because this happened on both a 5 TB Toshiba (MD04ACA500)

and 5 TB WD Red (WD50EFRX). A quick Google search doesn't seem to suggest firmware issues with either drive, but if your experience says otherwise I'm all ears. Also, my laziness is kicking in at this point (after running a memory test, full formats, replacing cables, recovering and rebuilding the pool!); I would like to avoid updating firmware given the higher likelihood that the controller is the point of failure.

Also, from a cable management point, a RocketRAID 2721 (or 2720SGL) are a better purchase. They use SAS cables, which are a bit more expensive, but make cable management a lot simpler.

Thanks for this suggestion. If the burst test comes back with damning evidence against the HighPoint, this will be the first thing I'll do.

Again, thanks for the responses.

Christopher (Drashna) · April 22, 2015

Saw nothing in the application log specific to the UI. It was the first time the UI ever crashed on me so I'm not terribly concerned about it.

If it only happened the once, and it hasn't happened again, then it may be related to the bad disks.

Tons of warnings:

storahci: Reset to device, \Device\RaidPort1, was issued.

disk: An error was detected on device \Device\Harddisk2\DR2 during a paging operation.

Ntfs (Microsoft-Windows-Ntfs): The system failed to flush data to the transaction log. Corruption may occur in VolumeId: G:, DeviceName: \Device\HarddiskVolume6. (A device which does not exist was specified.)

I have no idea how the drives are mapped so I don't know which disks "Harddisk2" or "RaidPort1" refer to. However, regarding the Ntfs error, the drive I assigned to G was definitely one of my trouble drives.

The RaidPort error may mean first or second disk on the controller.

The disk error, indicates the second on the system (run "diskmgmt.msc" to list the disks).

The NTFS error ... is clear, and since it confirms the issue... not "important" here.

If the other errors persist after removing the problem disk, then check Disk Management (diskmgmt.msc) and compare the disks.

I ran a couple passes with Memtest86 and they came back with no errors.

Thats good, for the most part. If you ahve the time, running the extended test is worth it. It may find errors the short test may not. But best left to overnight, or while you're away from home.

I'm going to go ahead and run the tests on the drives connected to the HighPoint controller.

That is a good idea. If you see the drives disappear... that would the controller's issue, and power cycling the system would fix it (the HighPoint cards sometimes have issues under heavy load, with some drives.

Wish I'd done this from the get-go. As posted earlier, the suspect drives gave me no incidents when I connected them to my USB dock to recover the files. I suppose strictly speaking, the files needed no "recovery"! I'm staring pretty hard at the controller now, which is why I'm running the burst test on the batch of drives I have connected to it before I explore other options.

If that's the case, then definitely check the connections and try swapping out the cables.

Also, check your power supply. What's the wattage rating on it, ... and what hardware are you using on it?

I doubt the possibility of firmware being the culprit because this happened on both a 5 TB Toshiba (MD04ACA500)

and 5 TB WD Red (WD50EFRX). A quick Google search doesn't seem to suggest firmware issues with either drive, but if your experience says otherwise I'm all ears. Also, my laziness is kicking in at this point (after running a memory test, full formats, replacing cables, recovering and rebuilding the pool!); I would like to avoid updating firmware given the higher likelihood that the controller is the point of failure.

As for the firmware, it's always a good idea to check when you're having issues. Just in case. Same with the controller card in fact.

At least check. Some companies will include a changelog. See if anything in that points to the issues you're seeing. If not, then skip it for now.

Thanks for this suggestion. If the burst test comes back with damning evidence against the HighPoint, this will be the first thing I'll do.

Again, thanks for the responses.

And you are very welcome.

And yes, troubleshooting an issue like this can be tiring and frustrating. Very.

Hopefully, we can help you get the issue identified and resolved soon!

aje112 · April 22, 2015

If that's the case, then definitely check the connections and try swapping out the cables.

Also, check your power supply. What's the wattage rating on it, ... and what hardware are you using on it?

I actually replaced all the cables. While I was pulling the drives, a wave of OCD hit and I decided to finally get the drives' cables on some consistent color scheme.

The PSU is a 450W Seasonic that powers the i3, 8 drives, the stock HSF, and 6 fans. Pulls less than 55 watts under load. I read somewhere to allocate about 10W per drive so I'm thinking I've got plenty of leeway.

As for the firmware, it's always a good idea to check when you're having issues. Just in case. Same with the controller card in fact.

Will do, especially for the HighPoint card. I reckon I'll go ahead and install the drivers for it too.

Once I get the pool up and running, I'll follow-up with some comments on the burst test results and rebuilt pool stability before the end of the week. I really appreciate the time and direction you've given me here. I really want the server to be as stable as possible and you've went well beyond DrivePool to help me think about some things to achieve just that. Thanks!

Christopher (Drashna) · April 22, 2015

That PSU should be good for that number of drives.

And yes, 8-10W per drive. That's about what I get from my NAS drives (I've paid attention as I've added them, and that's how much usage they bump up each time I add one).

As for the HighPoint card, yeah, definitely install the drivers for that! It may help with stability, actually.

And you are very welcome. And to be honest, I don't really see it as above and beyond, as the hardware is part of the "experience" (for a lack of better words). Since we deal with that as well, making sure it works right is just as important as making sure our software works right, too.

But we are glad that you do appreciate that we do try to do whatever we can to get your system working and stable!

aje112 · April 24, 2015

I'm going to jump the gun and post my conclusion here: my issues were most likely caused by my add-in controller, even after installing its driver. I haven't yet run the burst tests, but here is a list of events that point to it:

When I first built the pool and relocated some pooled drives onto the new controller, those exact drives began to exhibit file access issues. I didn't mention this when I originally posted because I didn't notice issues immediately after doing so.
Removing those same drives and reading them with my USB HDD dock exhibited no access issues (nor corruption).
When formatting a DIFFERENT set of drives on the suspect controller, the formats would freeze. This was attempted three times.
After flashing the firmware, the formats appear to be progressing normally and are further along. (To be clear, updating the DRIVER seemed to make no difference.)
Currently, my pool is rebuilt from drives using the onboard controller only. There are no I/O issues.

I could have done more to isolate the controller as the point of failure, but at this point, noticing how everything works perfectly fine until it's connected to the Rocket 640L is good enough for me. I guess by shopping for a cheaper card, I paid the price. So far, the fix appears to be to flash the stock firmware. For anyone looking, the Rocket 640L (not to be confused with the RocketRAID 640L) uses the Marvell 88SE9230 controller.

I've started a separate pool for the controller to verify controller stability before I let that controller contribute to my main pool. Also, burst tests will be underway soon.

My thanks to Christopher for his patience and guidance. To show my appreciation, I've tried to be thorough in this thread in hopes that anyone who runs into a similar problem with the same card finds this thread.

- AJ

Christopher (Drashna) · April 24, 2015

Ouch, and .... I can't say that I'm really surprised. I've had a HighPoint card before ... and it wasn't exactly stable.

If you replace the card, well, it depends on how many ports you want (if you want a lot, it may be worth looking into an LSI card, such as the IBM ServeRAID M1015, and flashing them).

airjrdn · April 24, 2015

If it helps, I'm running three IO Crest 4 port cards in my machine with 11 total drives so far and have yet to have a single issue. Might be worth looking at if you're looking to replace the HighPoint card.

Sign In

Corrupted Pool?

Question

aje112

14 answers to this question

Recommended Posts

airjrdn

Christopher (Drashna)

aje112

aje112

Christopher (Drashna)

aje112

Christopher (Drashna)

aje112

Christopher (Drashna)

aje112

Christopher (Drashna)

aje112

Christopher (Drashna)

airjrdn

Join the conversation

Browse

Activity