The topic of adding RAID-style parity to DrivePool was raised several times on the old forum. I've posted this FAQ because (1) this is a new forum and (2) a new user asked about adding folder-level parity, which - to mangle a phrase - is the same fish but a different kettle.
Since folks have varying levels of familiarity with parity I'm going to divide this post into three sections: (1) how parity works and the difference between parity and duplication, (2) the difference between drive-level and folder-level parity, (3) the TLDR conclusion for parity in DrivePool. I intend to update the post if anything changes or needs clarification (or if someone points out any mistakes I've made).
Disclaimer: I do not work for Covecube/Stablebit. These are my own comments. You don't know me from Jack. No warranty, express or implied, in this or any other universe.
Part 1: how parity works and the difference between parity and duplication
Duplication is fast. Every file gets simultaneously written to multiple disks, so as long as all of those disks don't die the file is still there, and by splitting reads amongst the copies you can load files faster. But to fully protect against a given number of disks dying, you need that many times number of disks. That doesn't just add up fast, it multiplies fast.
Parity relies on the ability to generate one or more "blocks" of a series of reversible checksums equal to the size of the largest protected "block" of content. If you want to protect three disks, each parity block requires its own disk as big as the biggest of those three disks. For every N parity blocks you have, any N data blocks can be recovered if they are corrupted or destroyed. Have twelve data disks and want to be protected against any two of them dying simultaneously? You'll only need two parity disks.
Sounds great, right? Right. But there are tradeoffs.
Whenever the content of any data block is altered, the corresponding checksums must be recalculated within the parity block, and if the content of any data block is corrupted or lost, the corresponding checksums must be combined with the remaining data blocks to rebuild the data. While duplication requires more disks, parity requires more time.
In a xN duplication system, you xN your disks, for each file it simultaneously writes the same data to N disks, but so long as p<=N disks die, where 'p' depends on which disks died, you replace the bad disk(s) and keep going - all of your data is immediately available. The drawback is the geometric increase in required disks and the risk of the wrong N disks dying simultaneously (e.g. if you have x2 duplication, if two disks die simultaneously and one happens to be a disk that was storing duplicates of the first disk's files, those are gone for good).
In a +N parity system, you add +N disks, for each file it writes the data to one disk and calculates the parity checksums which it then writes to N disks, but if any N disks die, you replace the bad disk(s) and wait while the computer recalculates and rebuilds the lost data - some of your data might still be available, but no data can be changed until it's finished (because parity needs to use the data on the good disks for its calculations).
(sidenote: "snapshot"-style parity systems attempt to reduce the time cost by risking a reduction in the amount of recoverable data; the more dynamic your content, the more you risk not being able to recover)
Part 2: the difference between drive-level and folder-level parity
Drive-level parity, aside from the math and the effort of writing the software, can be straightforward enough for the end user: you dedicate N drives to parity that are as big as the biggest drive in your data array. If this sounds good to you, some folks (e.g. fellow forum moderator Saitoh183) use DrivePool and the FlexRAID parity module together for this sort of thing. It apparently works very well.
(I'll note here that drive-level parity has two major implementation methods: striped and dedicated. In the dedicated method described above, parity and content are on separate disks, with the advantages of simplicity and readability and the disadvantage of increased wear on the parity disks and the risk that entails. In the striped method, each disk in the array contains both data and parity blocks; this spreads the wear evenly across all disks but makes the disks unreadable on other systems that don't have compatible parity software installed. There are ways to hybridise the two, but it's even more time and effort).
Folder-level parity is... more complicated. Your parity block has to be as big as the biggest folder in your data array. Move a file into that folder, and your folder is now bigger than your parity block - oops. This is a solvable problem, but 'solvable' does not mean 'easy', sort of how building a skyscraper is not the same as building a two-storey home. For what it's worth, FlexRAID's parity module is (at the time of my writing this post) $39.95 and that's drive-level parity.
Conclusion: the TLDR for parity in DrivePool
As I see it, DrivePool's "architectural imperative" is "elegant, friendly, reliable". This means not saddling the end-user with technical details or vast arrays of options. You pop in disks, tell the pool, done; a disk dies, swap it for a new one, tell the pool, done; a dead disk doesn't bring everything to a halt and size doesn't matter, done.
My impression (again, I don't speak for Covecube/Stablebit) is that parity appears to be in the category of "it would be nice to have for some users but practically it'd be a whole new subsystem, so unless a rabbit gets pulled out of a hat we're not going to see it any time soon and it might end up as a separate product even then (so that folks who just want pooling don't have to pay twice+ as much)".
Question
Shane
Link to comment
Share on other sites
1 answer to this question
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.