Jump to content
  • 0

Newbie question - File integrity checks?


kitonne

Question

I can set file duplication to 3 (default) but I cannot find any tools to check if the files maintain consistency across all three copies.   Assuming I use NTFS, TerraCopy can create MD5 checksums and store them in ADS for each file which can be verified by the MD5 verifiers from below (as well as by TerraCopy itself).  TerraCopy can also create a MD5 master file to compare against each file at a later date, but since DP is reading the file from any of the three copies at random, I can still get corrupted data if the three copies diverge and I get the MD5 calculated against the same copy of the data since I cannot specify that I only want to see #1, #2 or #3 copy of a specific file when I do a pool scrub.

https://github.com/TalAloni/MD5Stream

https://github.com/Y0tsuya/md5hash

If you move a couple of TB of data, sooner or later, you will see bits silently flipped.  In some older threads I found some notes regarding a High Integrity Disk Pool but they are a couple years old, and I wonder if there are any current plans to improve on the data integrity checks.

DriveBender does have "pool integrity check" in the main interface (I did not find a clear description of what it does) but you can only have one set of duplicated files and its support seems to be winding down, though its latest release is only a couple of month old.

Size and date checks are NOT data integrity checks - binary compare or CRC64 / MD5 / SHA4096 / etc are what I am looking for to confirm that the 3 default copies are indeed in sync (and using a 2 out of 3 rule to replace / fix corrupted copies is a small step afterwards).

StableBit Scanner is going in across a couple of my systems (separate and independent of DP) - it is reasonably priced, and it is easier to setup for repetitive tasks than HD Sentinel (though you still want HDS for data repair) but is no substitute for data integrity checks. 

1/  Is there any way to get data copy #1 or #2 or #3 from Drive Pool, instead of a random copy so I can implement external data integrity checks for the data stored in a DP (3 copies in this example)? 

2/  Are there any plans for internal data integrity checks between the multiple copies in a pool (binary compare, MD5, CRC64, whatever)?

3/  I did not find a clean way to replace a bad disk - I was expecting a "swap the disk, and run a duplicate_ckeck" to make sure the files in the pool get the specified number of copies   It looks like once a disk is removed, the rest is read only and you have to jump through hoops to restore r/w functionality.  Is there an way to just remove a bad drive from the pool?  Adding a new one is easy.....

4/  For 3 times file redundancy, using 5 physical disks (same size), how many disks may fail before I loose data?  In other words, is there a risk of having 2 out of 3 copies on the same disk?

Thank you!

Link to comment
Share on other sites

3 answers to this question

Recommended Posts

  • 0
3 hours ago, kitonne said:

1/  Is there any way to get data copy #1 or #2 or #3 from Drive Pool, instead of a random copy so I can implement external data integrity checks for the data stored in a DP (3 copies in this example)? 

Not directly.  However, the dpcmd tool does have some options that will list the full location of each file.  Specifically "dpcmd get-duplication (path-to-file)".

3 hours ago, kitonne said:

2/  Are there any plans for internal data integrity checks between the multiple copies in a pool (binary compare, MD5, CRC64, whatever)?

In some cases, it does run CRC comparison.   But adding more at runtime is problematic, since the driver runs in the kernel. So any operations need to be very quick.  Computting hashes of files is inherantly very expensive.  

If this sort of functionality is added, it won't be directly as part of StableBit DrivePool, for this reason, and others. 

3 hours ago, kitonne said:

3/  I did not find a clean way to replace a bad disk - I was expecting a "swap the disk, and run a duplicate_ckeck" to make sure the files in the pool get the specified number of copies   It looks like once a disk is removed, the rest is read only and you have to jump through hoops to restore r/w functionality.  Is there an way to just remove a bad drive from the pool?  Adding a new one is easy.....

Checking both the "Force damaged disk removal" and "duplicate data later" options should make the removal happen much faster.  But it will still move off data from the drive, if needed.  

Otherwise, data would be left on the disk, and if it's not duplicated data... 

That said, you can immediately eject the disk from the pool using the dpcmd tool.  However, this does not move ANY data from the drive.  Doing so will require manual intervention. Also, the disk still needs to be writable (it basically writes a "not part of a pool" tag to the PoolPart folder on the disk.  

3 hours ago, kitonne said:

4/  For 3 times file redundancy, using 5 physical disks (same size), how many disks may fail before I loose data?  In other words, is there a risk of having 2 out of 3 copies on the same disk?

2 disks.  Eg, X-1 disks.  X being the number of duplication.  So you can lose a number of disks equal to one less than the level of duplication. 

(also note that no duplication is basically a duplication of 1, so can tolerate 0 disks failing).

And StableBit DrivePool is aware of partitions, and will actively avoid putting copies of a file on the same physical disk.  This is part of wy we don't support dynamic disks, actually (because checking this becomes immensely more complicated with dynamic disks, and a lot more costly since this is also done in the kernel driver, as well). 

Also, even if you lose "too many disks", the rest of the pool will continue to work, with the data that is on the remaining disks. 

 

Link to comment
Share on other sites

  • 0

An internal data integrity check would have to be either a low priority task, run in the background at all times, or a priority "scrub" command which takes over 50+% of the CPU time and is scheduled for one weekend (or more).  A full scrub on a zraid 6 volume with 12TB of data is less than one hour, on an old HP microserver with dual core AMD (5 x 4TB physical disks).  Most modern CPUs have hardware accelerators for hash computations.

dpcommand can indeed provide the location for specific files,but automating the process via scripts would be quite a task.  Thank you for confirming that there is no easy way in DP to add to the main interface a box "return copy #..... first". 

Thank you for your detailed replies, and best regards!

Link to comment
Share on other sites

  • 0

Oh, definitely!  Balancing and duplication is already a low priority task, to prevent performance impact. 

But the data scrubbing part, or even just reading the checksums on the files...  Is not low impact.  A good example of this is the measuring for DrivePool.  You may notice that it HAMMERS the system when measuring the pool.  This is because it's opening each file to read it's attributes. This is very expensive to do, in terms of system resources.  The more small files you have, the worse it is.  This isn't even counting any sort of hashing, just opening the files. 

 

Also, an in depth integrity checker is something I'd love to see, personally.  

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...