Jump to content
Covecube Inc.
  • 0
RobbieH

I'm at a loss

Question

I did not want to blame Drivepool, I've been using it for years and have been a supporter the entire time, but something's going wrong with my servers and I think it's due to the latest beta. I upgraded a while back based on a recommendation, but now I can't keep my servers up. Don't worry, I'm not upset, I can recover from whatever happens, but this is just getting tiring trying to figure out what's going on.

 

It all started when one of my WD 5TB Red drives starting throwing SMART errors on "server 1", reported by Scanner. Not long after that, I started getting corruption errors. To make matters worse, the STxxx blah blah 3TB drive I use as a dup drive on a second VM also started failing, and throwing errors. Two problems at once. Joy. So, I pulled both drives from "server 2" and replaced them with 4TB HGST NAS drives. OK, so that one has been stable since. But, I'm still fighting issues on "server 1". I bought a 8TB archive drive, since I was running low on space anyway. It took me SEVERAL days to get all the data to the point I could remove the 'failing' WD 5TB drive. But, finally, after hours and hours of chkdsk /f and chkdsk /r and trying to move data, rinse repeat, I finally got that drive to remove. OK, things look good. I go a couple of days and it seems things are stabilized. Oh, I spoke too soon. Data is corrupting not only on the other 5TB drive, but now also data is corrupting like crazy on the new 8TB drive. I can't do a dang thing, I can't breathe in the general direction of the box even, without having data corruption and everything screwed up. 

 

I don't know what logs you need, what ANYTHING you need. You know I've supported Drivepool and Scanner. Maybe I'm wrong and it's not the fault. But I can't think of what else it might be. I haven't jacked around with anything that I know of. 

Share this post


Link to post
Share on other sites

21 answers to this question

Recommended Posts

  • 0

Just a quick question, have you looked at the event viewer to see if there are any IDE/ATAPI/DISK-type errors? That might point to HW issues in the server aside from HDDs, such or controller?

Share this post


Link to post
Share on other sites
  • 0

Just took a look, no errors to be concerned about. A few for not being able to set up shares due to missing folders (of course) and looks like I have an NTP issue, but no hardware related or IDE/SCSI errors.

Share this post


Link to post
Share on other sites
  • 0

What are the corruption errors?

 

can you be more specific?

 

If you have corrupt files - then dp is going to duplicate them and any rebalance is going to move then around between disks - which depending how the errors are picked up would lead to lots of warnings etc

Share this post


Link to post
Share on other sites
  • 0

Things go wonky, DP won't duplicate, says "errors on disk"

 

Reboot computer

 

CHKDSK runs for hours, and I mean like 12 hours. All sorts of data corruption issues are repaired. Run for a while, BAM, data corrupted again.

Share this post


Link to post
Share on other sites
  • 0

Your description reminds me of a file server where the fault was eventually traced to the hotswap rack the drives were kept in. Nothing showed up in event viewer (other than SMART reporting that yet another drive was dying) because the cause was below the OS level, and it killed several drives before everything else (including motherboard) was finally ruled out and the Rack Of Death was consigned to the trash.

 

So if there's anything - and I mean anything - between your drives and the motherboard, consider it a possible suspect.

Share this post


Link to post
Share on other sites
  • 0

Spider, all the common CHKDSK errors. File descriptor errors, security errors, truncated files, orphaned files... I got 'em all.

 

The only thing between the motherboard and the hard drives are SATA cables.


51209066313__8CA3AAB4-164A-450B-9470-106

 

51208908593__B32B9DBD-5BE3-4FF3-B72C-3B9

Share this post


Link to post
Share on other sites
  • 0

VMWare remote console why are you using that?

 

Is DP in a VM?

 

When you run Check disk you have DP service stopped ?

 

You have checked for a virus?

 

What else is running on the server besides Scanner and DP - anything data intensive?

Share this post


Link to post
Share on other sites
  • 0

Because it's a VM. 

 

DP is in a VM. I have talked many times with Drashna about it. It's configured correctly, and has been running right for years.

 

Checkdisk runs at pre-boot, DP can't start until Windows is booted.

 

Yes, I have checked for viruses.

 

Plex is the main application on this server, but it's been there since day 1. The only other app is WMC.

Share this post


Link to post
Share on other sites
  • 0

You might want to try new SATA cables anyway; they're cheap and while it's very rare they can still go bad.

 

I'm under the impression that DrivePool's communication with the physical drives is strictly via the NTFS subsystem of the OS, thus its not going to cause SMART errors other than via wear (like any other app) because SMART operates at the hardware level. So your problems starting with a drive throwing up SMART errors is a bit of a sign.

 

If you have spare SATA ports, try avoiding the port you had the faulty drive plugged into; if you have two SATA controllers on your board, try avoiding the controller you had the faulty drive plugged into. See if those make any difference.

 

Has anything else updated recently - the VM software or the board drivers? And if you're running Windows 10 check if it's updated a driver behind your back, because that's something I've had to deal with repeatedly.

Share this post


Link to post
Share on other sites
  • 0

When I start seeing problems, I would check them all. But again, I am down to just one drive.

 

And even now it is flaky as hell. The server just now thinks I have no drives in the pool. I clicked "Add", and it installed a driver for Covecube Drivepool,  I'm telling you, something is flaky.

 

In other words, the "non-pooled" drives are down to just C:, and there are no drives in the pool. My H: drive shows up, but is no longer in the pool and it also cannot be added to the pool.

 

So I guess I'm now down to zero drives.

 

Is Christopher gone from the forum?

 

Now this is getting frustrating because it's completely unstable.

 

Unhappily, I have now uninstalled DrivePool from this server. I have recopied from backups multiple times just to lose them again, and the stuff I did not have backups of are gone for good (which I blame on me) due to corruption.

 

NO WAY, I just removed DrivePool and now that drive is blank too, asking to be formatted. It cannot be a coincidence that I lost so many drives all at once, and only the drives in the pools. it is reporting back as RAW, the partition is gone. I'm now toast.

Share this post


Link to post
Share on other sites
  • 0

Is Christopher gone from the forum?

 

Check the "off topic" section.  I've had a bout of medical issues, and haven't been able to get to the forums, reliably.

If it's urgent, open a ticket. We get to those more reliably (eg, alex checks there if I'm unable to work for an extended period of time).

 

That said, I'm sorry to hear about all of the disk errors. 

 

As for DrivePool, we do a number of things to help prevent corruption.  For instance, when moving files around, we create a new copy of the file with a "COPYTEMP" extension.  Once that file is copied, then we manage the old file.  This should prevent corruption, as we do verify that the copy completes properly. 

 

 

 

That said, given your issues, I'd suspect cables, controller or power supply.   StableBit Scanner has a "Burst Test" option. This is useful for detecting "transmission errors".  

 

But as for the drive going "blank", it's gone "RAW".  In some cases, a CHKDSK pass will fix this, but generally requires data recovery....

 

As for why only "pooled" drives may be affected ... is because these drives are going to be more active than others in the system.  And that could increase issues, ESPECIALLY if this is a controller related issue. 

Share this post


Link to post
Share on other sites
  • 0

I am very sorry to hear about your medical issues. 

 

I suppose it could be a controller issue, but other drives in this box are working as expected, just on different VMs. 

 

I'm nixing the idea of a cable issue, I think it would be very unusual to have three cables to go bad at the same time.

 

For now I'm going to leave DP out, and might even remove Scanner, just to ensure these are not the cause of the issues. If I have more problems, I'll know that the problems can't be DP or Scanner.

Share this post


Link to post
Share on other sites
  • 0

Thanks, Hopefully, no more complications, and no more hospital trips....

 

 

 

As for the issues.  Yeah, three cables going bad at the same time is pretty unlikely (unless it's a SAS breakout cable).  

 

But yeah, it would be best to test one thing out at a time. 

 

If you're able to pinpoint the issue, please do let me know. 

Share this post


Link to post
Share on other sites
  • 0

So it has been over a week without DP installed, and zero corruption issues. I am going to put back ONE drive today, with a new cable, and pool across them. I really need to do this anyway because the only existing drive right now is an 8TB Seagate archive drive, and it's too slow to be a recording drive for WMC and the other tasks I need to do. I'm taking your advice and partitioning off 1TB of the 5TB WD Red to set up as "SSD", and will also use that partition for WMC's recorder.

 

This is using an Intel S1200RP motherboard and the onboard SATA controllers. 

 

Another VM, with two HGST NAS drives running WSE2012 and DP is still going strong, no problems at all.

Share this post


Link to post
Share on other sites
  • 0

Right now I'm running with the one drive added, only using it for recording from WMC and for MCEBuddy to read from. I have not re-established the pool, I want to make sure that the drive is 100% stable before I do. So far so good, but we will see. I'm going out of town this weekend so it'll continue to get used over the weekend. Maybe when I return on Tuesday I'll create a pool again.

 

Oh, Scanner is testing the drive in the background too, so that's a little more that's going on as a test.

 

You guys have me worried that I might have a controller problem, so I'm very nervous about putting the pool back.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...