Jump to content
  • 0

Feature request: Byte to Byte comparison of duplicated files


abudfv2008

Question

Hello.

I used to make a file comparison of my data with FreeFileSync (i make backups just mirroring needed data to NAS). I was forced to do so because sometimes got corrupted photos (RAW). It is rare, the drives are ok (of couse they have some degradation, but will live another 5years, at least) but as far as there are a lot of files and their number constantly grows - the chance of corruption raises also. The problem was that I learn about corrupted file only when i try to export photo from Lightroom. I look for the photo in the backup .... ooops it is corrupted too, because bad copy replaced good one during last syncing. So now i run bitwise comparison before syncing.

 

So the problem is that duplicated data is spread among several disks and i can't compare them with FreeFileSync or another tool. Is would be very good to implement some kind of bitwise comparison (on a folder level) to be able to find such corrupted duplicates.

 

Regards.

 

P.S. I think i missed the thread. It should be in DrivePool discussion.

Link to comment
Share on other sites

7 answers to this question

Recommended Posts

  • 0

Well, there are cases that the pool does this anyways.  

 

We check the date stamp, and if that doesn't match, we check the CRC of t he files.  This happens when accessing the files, or when running a duplication pass (which happens if realtime duplication isn't enabled, or duplication settings get changed, etc).

 

 

That said, you can always check the files in the hidden "PoolPart.xxxx" folders. Just keep in mind that the exact folder locations may change. 

 

 

As for checksumming all of the files, this is something that has been heavily requested, and may be a part of the next product released: 

http://wiki.covecube.com/Development_Status#StableBit_FileVault

Link to comment
Share on other sites

  • 0

Well.

 

May be I haven't explained it clearly. Sorry, English isn't my native language.

File Vault is potentially good product, but it solves a little bit another task. It is more like making hash(SHA/MD5) files aside protected ones. It is good for sharing software, but not good for ordinary use. Other software knows nothing about these md5 files, so when i move datafile to another place in LightRoom the md5 will stay where it was and become useless.

 

Checksumm/hash comparison of the files with backup or synced copy is not a problem - there are tools to do that.

The problem is that DrivePool duplicates of my files are spread between 6 disks. So it is impossible to use folder comparison of PoolPartxxxx with FreeFolderSync or any other tool (I need to rebalance this folder to use only 2 drives to do that). Another problem is that if I use 3Ñ… duplication than I have to compare 2 times (and also need a rebalance to use 3 drives)

 

Another potential problem is that when i make hash comparison with my backup, corrupted data can be missed, because good "stripe" is read while comparison, but at time of backup bad "stripe" could be read. 

 

So IMHO it is much better to implement duplicates hash comparison in DrivePool itself.

Link to comment
Share on other sites

  • 0

Sorry, I missed a few things here (in explanation) due to knowledge that I possess. 

 

First, STableBit DrivePool does do some checking.  When accessing files or during a duplication pass, it does check the data modified information on the files.  If this matches, then it moves on, but if it doesn't match, then the software does do a CRC comparison of the files. This isn't quote a binary comparison, but it should be sufficiently close enough to verify integrity. 

 

This happens automatically and invisibly, until there is an issue and then it gets brought up in the UI for resolution. 

 

And the reason that we choose to perform the checks this way, is that running a comparison for each and every files, even "just" a CRC file would HAVE to occur in the kernel and this would SIGNIFICANTLY slow down access to the pool. I mean, in many circumstances, it would actually render the software next to unusable.  if you want to see what I mean, enable verifier for "covefs.sys", reboot and then use something to index your pool.  It will render the pool next to unusable.  Running CRC or any sort of comparison (especially before accessing the file) will cause similar behavior. 

 

 

 

The other option here is to run a check periodically, as a system service (so that you don't need to be logged in).  This would still be processor and disk intensive, but there are things we can do to help minimize the impact here. 

The thing is, this is good for more than just Stablebit DrivePool, and would be good for just about *any* system, with or without StableBit DrivePool.   Because of that, any sort of checksumming/hashing utility would be better in a separate product. 

 

 

Additionaly, given some of the design ideas for StableBit FileVault, as well... it would fit much better into that product than into StableBit DrivePool.

 

Specifically, one of the things that has been discussed at length is using "file system filters".  File System filters are hybrid driver/service code. It sits on top of the system driver, and intercepts all communication.  Antivirus software use these for the real time protection.  But we could use this to create and verify the checksum of files, while minimizing the performance impact.

 

 

 

The last issue, is where to store the checksums/hashes.  There are a couple of ways that we can do this, and you've mentioned one already. That's md5 or similar type files stored locally.  That is definitely an option, but like you said, isn't necessarily the best.

The other options are to use a local database (such as SQLite), or even to store hidden file objects (such as "Alternate Data Streams", which would be moved with the file in most cases). 

 

 

 

 

The point here, is that this isn't a small feature that you're asking for. It's a hugely massive one. TO the point that it may be better to include it as a separate project/product rather than trying to cram it into StableBit DrivePool. 

 

That said, this is something that I *personally* want to see. If not what you're looking for exactly, then something close to it. 

Link to comment
Share on other sites

  • 0

Hello Christopher,

 

Are you going to work on StableBit FileVault?

I think the data integrity and file protection is for the most of your clients very important.

 

I don't do any of the development/coding. That's not my skillset (I mean, I know how, I'm just to good at it, especially for something "from scratch"). 

 

That said, right now, the priority is new release versions for StableBit DrivePool and StableBit Scanner, because it's been ... a long while. 

 

After that, it really depends on what Alex decides.  He's the owner and developer, so it's his call.  But I agree, and I am pushing for StableBit FileVault, as soon as possible. 

Link to comment
Share on other sites

  • 0

Hello Christopher,
 
1. Many people are using the DrivePool together with SnapRaid, because the DrivePool does not provide with data integrity protection - but for me DrivePool + SnapRaid over-complicated things.

 

2. What will be the main new features and benefits of new release versions for StableBit DrivePool and StableBit Scanner?

Link to comment
Share on other sites

  • 0

  1. We definitely understand that. 

    StableBit DrivePool does compare CRC values, but it does that when the dates don't match up (IIRC). 

     

     

  2. You mean the current betas, or the future? 

     

    If you mean current, then it's mostly bug fixes for both. 

    However, DrivePool features preliminary ReFS support and hierarchical pooling  (sub pools, eg adding a pool to another pool).  And we do plan on improving and fully implementing both. 

    As for StableBit Scanner, mostly bug fixes for now, but we do plan on adding ReFS support and a number of other things (potentially, planned).

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...