Expand functionality of file protection / duplication feature

one-liner · August 19, 2024

Long time user of DrivePool. Currently using file duplication for specific folders.

Scenario: data gets corrupted on one of the disks, either uncaught physical sector failure or bit rot.

Is DrivePool aware of the duplicated data inconsistency?
What is the current DrivePool strategy when ejecting a potentially failing drive for duplicated data - does the duplication continue on the rest of the disks in the pool, even if they were not selected by user for duplication?
Is the team currently working on expanding on file protection functionality? Specifically:
1. An option to the file placement strategy where the newest disks in the pool with most recent good SMART status and/or surface scan pass are considered
2. Ability to generate hash checks of the duplicated data, before initial duplication (or re-generate it per user or if it does not exist)
3. Use the hash checks to verify consistency of duplicated data across drives
4. When inconsistency is found, copy over from the "good" drive
5. Expand data resiliency with a mechanism like parity archives (PAR2) with options to:
  1. Choose the amount of parity check archives to generate, maybe as percentage of the actual data
  2. Choose location (disk) to store parity data - can be same drives as duplicated data or separate ones
  3. Data integrity checks of the PAR2 files (hashes - if this kind of thing is actually needed, not sure)
  4. Intelligent PAR2 cleanup when files are deleted through user action
  5. Intelligent data integrity check option: on user / program access of a duplicated file
  6. Scheduled data integrity check option: configurable, to do full hash checks of duplicated data, present user with info and option when inconsistencies were found (this is where the PAR2 data re-generation happens)

Shane · August 20, 2024

Re 1, DrivePool does not scrub duplicates at the content level, only by size and last modified date; it relies on the file system / hardware for content integrity. Some users make use of SnapRaid to do content integrity.

Re 2, DrivePool attempts to fulfill any duplication requirements when evacuating a bad disk from the pool. It appears to override file placement rules to do so (which I feel is a good thing, YMMV). However, your wording prompted me to test the Drive Usage Limiter balancer (I don't use it) and I found that it overrides evacuation by the StableBit Scanner balancer even when the latter is set to a higher priority. @Christopher (Drashna)

Re 3, I'd also like to know *hint hint*

one-liner · August 28, 2024

I stopped using the duplication feature of DrivePool and instead I created a 2-parity backup with SnapRAID. This way, not only do I have "duplication" but also data integrity and parity assurance.

One parity file is placed on a disk outside the pool while the other on a disk inside the pool. The space used by both parity files equals the actual space used by the data they are backing up. If my research is correct, this setup would enable recovery of data in the scenario of a 2 disk failure. And also in the scenario where there is one disk failure but it's the one holding the 2-parity file + data in scope of the backup.

I found the SnapRAID SMART status interesting, attempting indicate the probability of a disk failure over the next year based on the disk's SMART attributes and log, data calculated from BackBlaze detailed analysis of disk failures.

Suggestion: implement a similar feature for the Scanner + DrivePool combo in a way that the disks in a pool have a tag/rating attached to them based on a similar failure probability and maybe use this to create a mechanism of prioritizing specific disks for placing data deemed "the most important" -> chosen for duplication or just marked as such by the user. The result being that during balancing, the duplicated data (or marked as "most important") would have a prioritized placement on the healthiest disks.

Another suggestion: while specific folders can be designated to be placed on certain drives by the user via "file placement" option, after my SnapRAID setup I noticed that the data inside these folders gets moved around among those selected disks. It would be nice to have an option to "freeze" these folders in place after an initial balancing and only move them automatically if the SMART failure probability metric changes for the disks inside the pool. I can imagine a plugin that handles all of this kind of configuration.

Final suggestion would be to attempt a data protection feature using SnapRAID integration, where:

- disks outside the pool are considered in the protection scheme

- for n-parity, disks inside the pool could be considered, prioritizing the disks that are marked as most healthy for placing the additional parity file(s), outside the pool hidden dir -> they would show up as "other" data

Shane · August 28, 2024

1 hour ago, one-liner said:

One parity file is placed on a disk outside the pool while the other on a disk inside the pool. The space used by both parity files equals the actual space used by the data they are backing up.

As I understand it, SnapRAID requires that the usable space for each of its parity disk(s) must all remain equal to or larger than the largest used space of each of its data disks to avoid running out of disk space for parity calculations. How does that work with your setup?

one-liner · August 28, 2024

This applies to full disk array backup. But I chose specific folders to backup in SnapRAID config. So the parity file would be as large as the size of those folders + some overhead due to block size -> I chose here the smallest possible: 32k.

one-liner · September 2, 2024

Alright, at this point I don't think the devs care enough to provide answers or insights so I'll just keep my suggestions to myself because it looks like feedback is not appreciated nor valued.

Christopher (Drashna) · September 9, 2024

On 8/19/2024 at 6:48 PM, Shane said:

Re 2, DrivePool attempts to fulfill any duplication requirements when evacuating a bad disk from the pool. It appears to override file placement rules to do so (which I feel is a good thing, YMMV). However, your wording prompted me to test the Drive Usage Limiter balancer (I don't use it) and I found that it overrides evacuation by the StableBit Scanner balancer even when the latter is set to a higher priority. @Christopher (Drashna)

That's very much odd, I'll take a look into it, as it should be easily reproduceable.

Quote

What is the current DrivePool strategy when ejecting a potentially failing drive for duplicated data - does the duplication continue on the rest of the disks in the pool, even if they were not selected by user for duplication?

I'm not sure what you mean here. Could you clarify what you mean.

That said, the removal options do have a "duplicate data later" which skips any duplicated data on the drive being removed, and then runs a duplication pass to reduplicate data as needed.

On 8/19/2024 at 1:10 PM, one-liner said:

Is the team currently working on expanding on file protection functionality? Specifically:

Not currently.

Note that file placement is handled both by the service (during a duplication or balancing pass) as well as in the kernel. Any logic needs to be quick and as free of dependancies as possible. (the recent crowdstrike fiasco illustrates why this is important). And determining "newest" isn't necessarily simple. There are multiple criterias that can be considered for this. And that drives tend to fail in a bathtube curve, a new disk isn't necessary the best idea.

As for generating a hash, and storing it for verification is a very resource intensive process. Both CPU wise, but also requires storing a database with all of the hashes. And would essentially require doing this at the kernel level.

That said, there are a lot of solutions for this sort of behavior already, and will probably do it better than we could. As mentioned, SnapRAID can do a majority of this sort of thing, already.

one-liner · September 9, 2024

Quote

I'm not sure what you mean here. Could you clarify what you mean.

That said, the removal options do have a "duplicate data later" which skips any duplicated data on the drive being removed, and then runs a duplication pass to reduplicate data as needed.

Scenario:

1. Pool with 3 disks: 1, 2, 3

2. Folder chosen for 2x duplication only on disks 1 and 2, as specified by user.

3. Disk 2 is failing, data is ejected from it.

What happens with the 2x duplication logic for folder since it was set by user to duplicate on disk 1 and the now failing disk 2? You mention a "duplication pass" but in this scenario data would have to be duplicated on disk 3. But user selected only disks 1 and 2.

Quote

Quote

Note that file placement is handled both by the service (during a duplication or balancing pass) as well as in the kernel. Any logic needs to be quick and as free of dependancies as possible.

Yes, it is understandable. My suggestion was for a separate process which does not kick in during normal syscalls to content from disks. I can mention balancing which can be set up to run immediately or at specific times and intervals.

Quote

And determining "newest" isn't necessarily simple. There are multiple criterias that can be considered for this. And that drives tend to fail in a bathtube curve, a new disk isn't necessary the best idea.

Yes I provided a quick example to illustrate what I am after. My inspiration comes from SnapRAID's "smart" command which I described what it does. Of course, that probability estimation is just that, an estimation of a probability. There is no perfect solution but I think a logic can be implemented to lower the chances of losing data which is considered "important" -> the data that the user chooses to duplicate (but not necessarily, some folders could be marked as "important" without duplication and this could determine their disks placement through this logic).

So: a combination of drive age (not too new), SMART status, surface scan and the logic that SnapRAID uses could offer each drive from the pool a health score. Another software that is trying to do that is HDSentinel. What I am implying is that the integration with StableBit Scanner should be better leveraged to construct a strategy for "important" data placement on disks determined less likely to fail (which, I am aware, is not a guarantee but it remains a user choice). This also implies that Scanner could be improved a bit -> look at what HDSentinel is doing.

Quote

As for generating a hash, and storing it for verification is a very resource intensive process. Both CPU wise, but also requires storing a database with all of the hashes. And would essentially require doing this at the kernel level.

Yes, as I mentioned above, this should be a separate non-real-time process. Not sure about CPU usage, SnapRAID does this with almost no CPU usage (I have 1% usage on mine for all operations).

Quote

That said, there are a lot of solutions for this sort of behavior already, and will probably do it better than we could. As mentioned, SnapRAID can do a majority of this sort of thing, already.

Yes, as I mentioned in one of my above replies, I am already using SnapRAID and removed all duplication in DrivePool since I did the same with SnapRAID and additionally it offers data integrity features to guard against bit rot and silent file system corruption which can happen at any time, without being caught.

The point of my post is that we can all agree that DrivePool is quite mature with its current functionalities and I was hoping it would get expanded with new features, summarizing everything:

A disk health rating system (reference: SnapRAID's "smart" command and HDSentinel) which will be used for placement of data marked as "important" by user
A data parity/integrity functionality for the "important" data
1. new option, user selectable: generate hashes for "important" data (applies also for duplicated data)
2. new option, user selectable: schedule data integrity check for the "important" data, by comparing the hashes (applies also for duplicated data)
Ideally, split data duplication in two user selectable options:
1. "Online" or "real-time" data duplication -> how it is now
  1. New option, user selectable: before duplicating, check hashes (initial duplication and new data duplication in target folders -> deferred process, duplication is "semi" real-time)
2. "Offline" data duplication -> what SnapRAID does (even an integration with SnapRAID if possible)
  1. Choosing this method, real-time duplication should be removed for folders in scope; duplicated folders outside scope can continue to be duplicated by "online" / "real-time" duplication
  2. How parity/integrity works: the hashes are recalculated through the process of generating the parity files, only for the folders in scope for this duplication method
  3. Parity file placement logic
    1. disk outside the pool if available or another pool
    2. if no disk outside the pool is available for holding the parity file: place parity file on a disk inside the pool which does not already contain the data in scope and/or move the data in scope so that none resides on at least one of the disks in the pool where the parity file resides
  4. n-parity file placement logic
    1. if available, n disks outside the pool or even other pools in available
    2. if partially available: combination of disk(s) outside the pool AND ( another pool altogether if available OR healthiest disk(s) inside the current pool)
    3. if fully unavailable: follow logic from 3. 2. 3. 2., placing each n-parity file on a different disk and no disk that actually holds the data in scope, if possible (though with n-parity this could be permitted since the whole thing with n-parity is to recover from multiple disk failures)
  5. The "offline" duplication data will show up as "other" or "offline duplication" inside the pool
  6. NOTE: for all the above, limits should be imposed based on the size of data in scope, available disks, disks health etc

Sign In

Expand functionality of file protection / duplication feature

Question

one-liner

7 answers to this question

Recommended Posts

Shane

one-liner

Shane

one-liner

one-liner

Christopher (Drashna)

one-liner

Join the conversation

Browse

Activity