Jump to content
  • 0

Alternatives to redundant/identically named file handling


baccarat

Question

When a pool encounters redundant files (or just identically named ones) it considers only one. Scenario:

 

Drive A has files: hello.txt, world.txt

Drive B has files: bye.txt, world.txt

 

Pool Z shows files: hello.txt, bye.txt, world.txt.

 

World.txt exists on both drives but the pool only shows one of the copies. I know it would not be possible for the pool to have multiplle identically named files but is there a way to handle this so both copies would show in the pool?

Link to comment
Share on other sites

14 answers to this question

Recommended Posts

  • 0

You'd need to rename one of the files.
 
Specifically, this is how we store duplicated files. They are stored in identical paths (at least identical under the PoolPart folder), and with the same name.  The driver determines which file to grab, or to grab both (the "Read Striping" feature determines this). 
 
And let me quote Alex (the developer) on how the read striping feature works:
 

Our read striping algorithm is actually a number of different algorithms that are automatically selected based on the situation. However, they generally do not increase performance by a factor of how many duplicated files you have. They are generally good at providing a smooth media streaming experience and balancing the performance of disks with different performance characteristics.
 
For example:

  • For large asynchronous (i.e. multi-threaded) file transfers via. SMB, read striping attempts to mimic a RAID-like striping algorithm where it will send read request in fixed block sized chunks to the disk with the least load.
  • For single-threaded transfers, media streaming applications or random access, the read striping algorithm passively monitors the performance of all the disks that contain the given file part and selects the disk with the fastest response time. For example, if you're streaming a duplicated video file from the D: drive, and all of the sudden something starts accessing the D: drive excessively, the read striping algorithm will switch to reading from another disk that is less busy.
Also I did write an informal blog post that demonstrates read striping in action:
http://blog.covecube.com/2013/05/stablebit-drivepool-2-0-0-256-beta-performance-ui
 
You can also see the user manual for some more information on it:
http://stablebit.com/Support/DrivePool/2.X/Manual?Section=Performance%20Options

 

Additionally, if you have disks connected via multiple different controllers (such as SATA and USB), in some cases, StableBit DrivePool will pick the disk on the faster bus, instead of reading from both parts. 

Link to comment
Share on other sites

  • 0

Yes, I had played with the Read Striping option, it doesn't make a difference in my case as all drives are using the same (SATA) interface. Tweaking the logic to select which duplicate to show isn't what I'm after anway, unfortunately. Regardless of how (which copy) DP shows, as far as I can see it does not indicate in any way it has encountered multiple copies (or just identically named, could be different size/date) which I was kinda expecting, maybe not in the DP disk but in the program UI. Is there a way to expose the Read Striping usage decision DP is taking with a log or plugin or something?

Link to comment
Share on other sites

  • 0

What file is used is determined by the driver, as files with the same name in the same folder are how we generate the duplicate files. 

 

So, it's not going to generate a notice that it's detected multiple files in the same location. However, when accessing the file or during a duplication pass, it will check the file modified date and run a checksum of the file, if the date is different. If the files are found to be different, it will be flagged and prompt you in the UI (file mismatch error, specifical). 

 

Aside from that, there is no notification what file file it selects (except for in the performance UI, but that's not detailed information, as it's very transient). 

 

However, as for adding a log of what file was selected and why, that may be doable, and I've flagged the request for Alex (the developer)

https://stablebit.com/Admin/IssueAnalysis/17750

Link to comment
Share on other sites

  • 0

Check sum during duplication (for those who use it) is one helpful functionality but not directly relevant in this case. The files may not be identical from a check sum view, only the name may match with size/date/etc differing as mentioned. The issue isn't about avoiding wasted space in data duplication, it is more basic- making identically named files show up (or at the very least being notified they are being excluded).

Link to comment
Share on other sites

  • 0

Check sum during duplication (for those who use it) is one helpful functionality but not directly relevant in this case. The files may not be identical from a check sum view, only the name may match with size/date/etc differing as mentioned. The issue isn't about avoiding wasted space in data duplication, it is more basic- making identically named files show up (or at the very least being notified they are being excluded).

To be blunt here, this isn't going to happen.  

 

Specifically, because of the way we store duplicated files.  They are stored in identical folder structures under the PoolPart.xxxx folder, with identical file names.  That's how we store duplicate files.   

 

And since that's how we store duplicated files, we only want to show the file once, as it should be identical (and we do a number of checks to try and ensure that, as previously mentioned). 

 

Additionally, we don't differentiate between the "original" and "copy" files, which is part of why (and a by product of) we store them the way we do. 

 

 

When two different files are detected in the same location, they're considered duplicates and we'll flag them because they *are* duplicated files with mismatched contents.

 

 

Additionally, the Read Striping feature determines which file to actually read from, or reads from both at the same time. 

 

 

This is how StableBit DrivePool works, and it's not going to change. Not without a HUGE overhaul to the program.  

 

 

Additionally, I believe that Explorer (and other programs) would basically badly error out if it showed two identically named files in the same folder. Windows just doesn't like that.   So displaying the two files in the same folder would potentially cause a lot of other issues. 

 

 

 

 

To summarize, identically named files, in identical folder structures are not ignored. They're duplicates, or will be considered as such by our software. And as such, they will only displayed once, as they should be identical. That's part of it's core design of the product. 

 

 

 

I apologize if I'm coming off as harsh/mean/aggressive/etc.  I don't mean to, and I'm actively trying to avoid it.  

Link to comment
Share on other sites

  • 0

Harsh or mean, what are you talking about? We're discussing/debating!

 

I wasn't expecting DP to break the rules of the universe and show multiple identically named files in the pool. DP showing duplicate copies (as in those DP mirrored as backup) once in the pool makes sense of course. Not distinguishing original vs copy is acceptable, although some users like/want  and alternatives offer this feature.

But, DP's current logic/definition of a "duplicate" is a little too simplistic I think you might agree, particularly for those of us who don't use DP's duplication functionality to begin with. I can see what DP is trying to avoid, if there already is a spare copy (identical with checksum match) fulfilling the backup requirement then no need for DP to create yet more copies but, and it's a big but, "when two different files are detected in the same location, they're considered duplicates and we'll flag them because they *are* duplicated files with mismatched contents" is only one possible way to interpret this scenario. There are many valid pooling scenarios where same named files across different source drives are not just "duplicate files with mismatched contents". This may be prevailing outcome/desire in a home user media pooling scenario but falls apart in the real chaotic world.

Link to comment
Share on other sites

  • 0

I don't see how there can be many pooling scenarios where same named files across different source drives are possible. The causes I can think of is a user copying a file to, say, drive B while such a file already existed in the Pool (and thus on drive B as well). That would be rather stupid and would have to be deliberate (unhiding the poolpart folders, then going down the folder tree and copying to the "right" location). And even then, DP would signal this and let the user sort out the mess he made himself.

 

Whether duplication is on or off in this case is irrelevant. If it is on, then the alert would come up. If it is off then DP would want to delete one of the two copies and not be able to chose which one to delete and raise an alert as well.

 

But I could be wrong. Paint me a scenario where, using DP, you could end up with a situation like this.

Link to comment
Share on other sites

  • 0

Harsh or mean, what are you talking about? We're discussing/debating!

I just wanted to make sure, as the tone of what I was posting could definitely come off that way.  I generally try to go out of my way to avoid being harsh or the like, but I just wanted to emphasize that. 

 

 

And yes, DrivePool's handling of duplication and duplicate files like this is very simplistic. This is by design, as I believe that by doing so, it significantly reduces the CPU time and memory overhead for the kernel driver when accessing the duplicate files and using the Read Striping and Real Time Duplication features. 

 

As for adding disk indicators, any modification like that would definitely increase the driver's overhead, and could make handling the files significantly more complete. 

 

 

As for the mismatched contents, I believe we default to accessing the newer file, in this case (I'll have to ask Alex, the developer about this specifically, as I don't quite remember).  

However, the file is then immediately marked as a "problem files" and creates a "duplication conflict" that pops up in the UI, and creates a notification. The UI wants you to resolve the issue by overwriting the older file. 

 

 

 

Also, we generally recommend avoiding accessing the PoolPart.xxxxx folders directly, In fact, this is why StableBit DrivePool creates these folder as hidden by default. To protect them from modification by normal users. 

 

 

 

 

 

Additionally, based on your other post (sorry for missing the connection), it sounds like you want to move the contents from one pool to another. If that is the case, there are a few ways to do this. If this is the case, let me know, as it's a bit more technical (and I may be a bit tired ATM).  

Link to comment
Share on other sites

  • 0

It's cool. I get it. Maybe if you can add the option to run (manually trigger) the "problem file" check/notification in the UI manually (what is run automatically/background as part of DP duplication) may be helpful to some. If not in DP philosophy, it's cool, I'll do a workaround anyway before your devs get to it.

 

I'm coming over to DP from alternatives that had more features (snapshot as well as real-time, parity as well as mirror, etc.), but KISS has benefits and I'm OK with DP being very simplistic. I had attributed significant increases in price (friend bought for less than half what I paid last year) to more features like snapshot, parity, etc. being developed but this was not a correct assumption.

 

You are correct, and to umfriend's confusion, I am using DP somewhat backwards (not as intended, apparently) in that the pool unifies drives from several networked machines, not just one, with files being manipulated directly in the PoolPart.xxxxx folders of the local machines.

Link to comment
Share on other sites

  • 0

Well, there are a number of things that will trigger the duplication pass.  

 

Disabling real time duplication will. Specifically, in this case, it will run a pass at 2am every day. 

Changing the duplication status on any folder will trigger a pass.

Remeasuring the pool will trigger a pass.

 

 

And yeah, coming over from stuff like SnapRAID can be very ... different.

 

As Umfriend has indicated, we generally recommend accessing the Pool contents from the drive presented.  While you can absolutely access it from the PoolPart.xxxxx folders, we generally do recommend against that. 

 

 

And by the "you are correct" comment, do you mean that you're trying to merge the pools together?

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...