Jump to content
  • 0

Server 2012 R2 Deduplication and DrivePool


Kayos

Question

This is Kayot, due to some weirdness I can't seem to log into my real account. The recovery email won't show up either (I checked spam) That aside;

 

Does Server 2012 R2 Deduplication break DrivePool?

 

I know it has to be per drive, which is fine since I use a SnapRaid. I was reading from posts in May on the old forum that DrivePool couldn't access deduped files. Since I'm about to Dedup my Archive I wanted to make sure I wasn't about to nuke DrivePool.

Link to comment
Share on other sites

Recommended Posts

  • 0

I'm not actually sure about that.

 

IIRC, how DeDuplication works, yes, it may cause issues with DrivePool. Or at the very least, negate the point of duplication. It basically hardlinks files so there is only one copy of the file. The exact opposite of what Duplication does in DrivePool.

Link to comment
Share on other sites

  • 0

Deduplication works on a block level. Basically if block 55 in file x is the same as block 22 in file y, the system will set block 22 in file y to block 55 in file x and sparce it in file y. This is of course a gross simplification as the block in question no longer belongs to either file and both files get a relink and a sparce out. The TOC or FAT of MFT or what ever you want to call it lists each block location per file. It only works on the same partiton. That's why Deduplication dosne't work with ReFS as ReFS is incapable of per block listing due to it making sure a file isn't fragmented and listing a file as a block location and its extent. Only NTFS can support Deduplication in Windows 2012/R2. As for using DrivePool; it's not pointless as duplicaton in DrivePool is performed by making sure a file is on multiple partitions. Deduplication can only work per drive. Targeting the drive pool will definatly throw up an error as Deduplication requires direct partition access.

 

What I want to know is, will it kill file access to Deduplicated files, such as file x and y from above. Has any one tested it?

 

If not, I'll give it a try on a sub directory filled with oblious dupes and see what happens.

Link to comment
Share on other sites

  • 0

Okay, I wasn't entirely sure about that.

 

I do remember somebody using the built in deduplication feature of 2012 and it causing some serious issues. But I haven't seen it mentioned since.

I guess I know what I'm doing this week. Good thing I have testing VMs, right?

Though I suspect that it will still cause issues with DrivePool. But we shall see.

Link to comment
Share on other sites

  • 0

I went ahead and did a test.

 

It's still a problem.

 

I can't access deduped files from the pool but I can access them from their respective drives.

 

Also, they report some odd file sizes in File properties, but that seems universal. I'm guessing that's a side effect of the deduping process. It still seems untidy from my perspective of MS.

Link to comment
Share on other sites

  • 0

I was afraid of that. :(

 

I was going to test that out myself today (on my to-do list actually), but it seems that you've beaten me to that. 

 

And the word I would use is "sloppy". But it's a relatively new feature, and naturally bound to have issues. I mean, look at Drive Extender in v1, and at Storage Spaces. Both have had a LOT of reported issues. Though, from the other side. Sometimes it is hard to test every possible configuration... But, something like that file properties thing is something that should have been fixed right away.

Link to comment
Share on other sites

  • 0

I won't touch Storage Spaces or any Windows Raid like solution after almost loosing a 10TB archive when Windows "forgot" that my drives were in the array and threw up an error stating that my raid was unrecoverable and there for "fixing" the main record to report such. I had to bring it back from a backup that I no longer have. That wasn't an isolated case either. I like DrivePool and SnapRAID because if the unthinkable happens and I loose a drive or two I can recover it/them with SnapRAID (Happened once with a single drive, but that drive was on it's way out anyway) and DrivePool puts them into a single drive that I can easily organize and share over a network.

 

I hope DrivePool figures this one out. Deduplication has the potential to save a ton of space, though if it works too well it may pose a problem with SnapRAID. Deduplication and a once full drive could spell disaster for restoration since I could potentially end up with 3TB of data on a 2TB drive. For now I'll simply compress my drives. I'm no where near running out of space but it's nice to maintain as much free space as possible. Plus, the compressed data with a fast processor actually speeds the read rate up. This I have confirmed as well.

Link to comment
Share on other sites

  • 0

Actually, alex and I were just talking about this. And funny enough, disk compression came up. :)

 

As for the de-duplication, you hit on the biggest issue. If a drive fails, you could lose a good chunk of data. Which is the opposite point of Duplication in StableBit DrivePool.

 

However, there is an Addin for "WHS2011" that you may want to look into: Snoop-de-Dupe (http://www.snoopdedupe.com/). It specifically looks for duplicate files and lets decide how to handle them. Sounds like something you may be interested it.

 

 

As for figuring out, well, he's already got a few very challenging things on his plate for DrivePool already. But this may be on his to-do list already.

http://community.covecube.com/index.php?/topic/252-future-developments/

Link to comment
Share on other sites

  • 0

Sorry for the delay. I actually made a piece of software that does the above. The problem was that Windows shares had a rough time with Hardlinked files. Also, renaming files was impossible in the case of files linked from torrent downloads. I just went ahead and removed my duplicates using snapraids duplicate check and a program that I made to help make the process really fast. It's unpolished, but works. The beautiful thing about deduplication is that two files that aren't duplicates but have duplicate parts get a file size reduction.

 

This is apparent in mp3's were a tag is changed but the music data is the same. A file based dup checker see's two different files but dedupliction sees maybe one 4096 byte chunk of different data and the rest is the same. I hope this gets added to DrivePool. My laptop is running a StorageSpace with Deduplication since I can loose that data due to having a backup. It's saving about 30% in space which is amazing considering that it's mostly cartoons (Anime) and pictures (Manga).

 

One thing studying snapraid has done for me is better understand the underlying architecture of the file system. I now know that dedup is done by hashing each chunk of a file by inode. Meaning that a files location and name can change without renewing the hash or resetting the dedup. It's genius and has opened my mind to a host of possibilities regarding software I can write.

Link to comment
Share on other sites

  • 0

30% makes sense, specially when you have a lot of "white space" in some of those files. :)

 

And I can definitely understanding wanting to maximize your space. But personally, I'd rather get a couple of 4TB drives and throw them in and duplicate. That way I know I'm set if a drive does fail. 
 

Also, ahve you seen the HGST's Helium cooled drives? 6TBs. ;)

Link to comment
Share on other sites

  • 0

I use snapraid with q parity so no need for duplication. The chances of loosing two drives at once is astronomical. I'm mainly trying to get my current storage to stretch farther. I'll defiantly replace drives and what not as time goes by.

 

As for those helium drives. The price tag is a real put off. Lets face it, 350$ for 4TB is way too much. I know it'll come down. I have a feeling that the next gen storage wars will be between solid state and this helium tech. For now they're the playthings of the wealthy and hobbyists. The fact that this discussion is on a forum for drive pooling a bunch of small disks into one big one speaks volumes to our current approach. It's cheaper to use a port duplicator and some dup boards and just add a bunch of drives then to simply buy a bigger disk.

 

What I want to know is why software like DrivePool is so nitch? Why do so many people risk all their data in software raid 5's and 6's when loosing two or three (or having windows/linux think one failed, this is endless hell) drives kills all the data. Without an XOR chip, those solutions aren't giving anything that this solution won't. It boggles the mind.

Link to comment
Share on other sites

  • 0

You're absolutely right about the "play things". The price will be astronomical for a good while. But it's nice to see that density. :)

But as for drivePooling... my smallest drive in the pool is a 3TB drive. So... it depends on what you have and what you want, I guess. :)

 

As for "niche"/"nitch", from what I've seen, there are two camps here (and this pretty much applies to all things, but especially tech). There are the people that don't know any better. And there are the people that refuse to pay for the hardware/software to pool/raid/etc their data properly.  I would say that their is a third group that spreads misinformation, but that's a combination of the two, really.

I mean, I've seen all sorts of solutions touted as the best thing since sliced bread. Until something goes wrong. And then you've lost everything. Heck, I've seen that reported with hardware RAID5. (cascading disk failure). 

And Alex has spent a lot of time trying to make recovery as simple and as painless as possible (but there is still a failed disk, so some pain always). And we how that it shows.  Between DrivePool pooling and duplicating the data, and Scanner checking and maintaining the health of your drives (and the two products working together to try help prevent data loss). Well, I was recommended the heck out of both products before Alex hired me. :)

 

But yeah, I cringe when I see stories of people losing all their data. I've been there, and done that. And it's not fun. :(

Link to comment
Share on other sites

  • 0

I couldn't get spell check to give me the right "niche" spelling.

 

I decided a long time ago that if I ever lost all my data, I'd never make another Archive again. Last time when using a Windows raid, I thought I had lost it all. I didn't even feel too bad. I've been learning recently that age old expression, "The things you posses soon come to posses you." So I don't see it as being the worst possible thing. Though I'd totally ebay all the drives and hardware. Cash is always useful. Ebay does take 10% of all sales along with PayPals 3% +$.30 and ebay horribly under writes shipping fee's which means that I take a hit in shipping costs, but dodge part of that 10%. I've considered Craigslist, though I've heard bad things regarding that.

 

I was reading that people honestly believe that Drive Extender was replaced with Storage Spaces. That's complete bunk. I loose a drive with Drive Extender I still have the other data from the remaining drives. Storage Spaces nukes it all. How is that a replacement? It's just another name for fakeRAID. It's also proprietary to Windows 8+. Missing a drive? It's now in cycle hell. DrivePool is the successor to Drive Extender.

 

Anyway, this topic is now derailed so I guess that's enough. I hope deduplication gets added in the next two years, but if not, oh well. It was nice chatting with you.

Link to comment
Share on other sites

  • 0

lol. Actually, if you look up the definitions, niche and nitch are the same thing. So it is a matter of preference. I just prefer niche. 

 

 

I've never used the Windows Dynamic disks (software RAID). I've messed with them before and have been burned. So I've avoided it. 

 

And yeah, eBay/PayPal takes a chunk out. :(

Though, I've had some good success with craigslist. But you have to look at the site like a swap meet. Not everything you buy is going to work, and not everyone that contacts you is interested in buying. And There are scammers. I've bought a few things via Craigslist, and have been very happy (my Portable AC unit rocks, BTW, 12k BTU unit for ~$200). Though, I've never tried selling.

 

 

As for Storage Spaces replacing Drive Extender.... I can tell you for a fact that this idea is correct. I was an MS MVP for WHS for 4 years, after all. And that's about all the detail I can go into that.

But as for working like Drive Extended, I completely agree. They took a great idea, and wrapped it up tightly with Dynamic disks and gave it a new name.  And yeah, it's not as resilient, nor anywhere near as good.

And yes, products like StableBit DrivePool are very much the spiritual successor to Drive Extender. And do a MUCH better job than Storage Spaces. But then again, DrivePool fixes most (if not all) the issues with Drive Extender, and adds a bunch more options too!

 

 

And yeah, derailed a bit. :) But that's fine, IMO.

And I can't speak for Alex, so I can't say if we will add deduplication (or soon). But Alex does have a list of features he does plan on adding. But one thing at a time, right?

Link to comment
Share on other sites

  • 0

I won't use Dynamic Disks because I can't save them in Linux. Same goes for Windows fakeRAID and StorageSpaces. I do most of my partition work in Mint as Windows seems to have a conniption fit with even the most basic partition work of non-dynamic drives. Their excuse? Oh, there is third party software for that. Yea, it's call gparted... d*cks.

 

I had a disk fail and windows refused to mount it in any way, shape, or form. Linux Mint mounted it just fine, though it did suggest that I reformat it, then later that I should toss it. I got everything off of it and it all sha-512'd fine. If I was stuck in windows only, I would have lost it all.

 

When someone on a Forum mentions Dynamic Disks and Windows fakeRAID I cringe. On linux forums it's LVM (Logical Volume Management) and mdadm which Microsoft stole and now calls StorageSpaces. RAID is NOT a backup people! SnapRAID isn't a backup either, but it's much closer when up to date. RAID was designed to keep run time up while the drives repopulate. SnapRAID restores missing drives, during which the missing data is unaccessible meaning that SnapRAID isn't RAID at all. SnapRAID is for Archives, RAID is for production data such as databases and massive web servers which need uptime. Downtime is money.

 

That said, DrivePool with SnapRAID (and once a month parity check) is a win. Deduplication is too new for my tastes. I've been playing with it for the last two weeks. I can't get a file size query in properties and the only way to know what I've saved in space is to use PowerShell. It's a huge space saver (even compared to disk compression as it's best) and definitely worth most of the trouble, but I think I'll wait till Microsoft figures it out. Besides, I can run Deduplication in Windows 8.1 without a problem. May as well use a desktop OS on this machine.

Link to comment
Share on other sites

  • 0

Dynamic disks are just trouble in general.  I've lost a decent amount of data to them, waaay long time ago. And I just don't see them as reliable. 

 

Though I can't really comment on the partition stuff. I mainly only use Windows based tools. As for the drive not showing up, I've seem similar. At least, where it should up as RAW or uninitialized. Had to use data recovery, but ... windows based. :)

 

 

And I'm not getting into any arguement involving linux. I know better.

That said, LVM isn't linux tech, it seems like IBM actually owns the patent on the technology. so.... yeah, and mdadm is dated 2001, where dynamic disks were apparently introduced in Windows 2000, released in it's namesake's year)
I also can't comment on the idea for Storage Spaces, but I suspect that it was developed as a way to FULLY support all file system utils, *and* implement a DE type feature. Not exactly successfully, but still.

 

And no, RAID definitely isn't a backup. And it's depressing that people think of it as such. It's redundancy. And that is also what DrivePool's duplication is. It helps pretect you against drive failure. Not against .... well nine times out of ten, user error. (be it id 10 t, or accidental)

 

 

 

And for the most part, I think we pretty much are in agreement, here.

 

Though, I'm not a fan of Parity. It may save a bit of space, but requires dedicated hardware to do right, and takes a LOT of  time to use for recovery. Mirroring is better for uptime.

And I'm also not a fan of deduplication. The reason is this: WHS' client backup. Deduplication saves on space, and can be a very good thing. But it makes your data much more vulnerable. All you need is one bad sector in the right (or wrong) spot, and you lose a LOT of information. Especially if you don't have a backup. And Alex has a fun story about that too (one that I've also experienced, though he found a "fix"). I'd rather spend money on more HDDs and HBA cards, and mirror/duplicate everything.

 

Though, deduplication on a desktop OS isn't a bad idea at all. Especially if you have a good, reliably backup. :)

Link to comment
Share on other sites

  • 0

Actually, alex and I were just talking about this. And funny enough, disk compression came up. :)

 

As for the de-duplication, you hit on the biggest issue. If a drive fails, you could lose a good chunk of data. Which is the opposite point of Duplication in StableBit DrivePool.

 

However, there is an Addin for "WHS2011" that you may want to look into: Snoop-de-Dupe (http://www.snoopdedupe.com/). It specifically looks for duplicate files and lets decide how to handle them. Sounds like something you may be interested it.

 

 

As for figuring out, well, he's already got a few very challenging things on his plate for DrivePool already. But this may be on his to-do list already.

http://community.covecube.com/index.php?/topic/252-future-developments/

 

I think your mising the understanding of how the De-Duplication process/method is supposed to work.

 

De-Duplication should be directly processed on the PoolDrive itself, and not the independant drives that make up the pool.

DrivePool's operations to the individual disk should be completely transparent to any actions done to the Pooled disk.

 

De-duplication should perform the deduplication on the mounted virtual DrivePool disk, it does this I believe by some sort of hard links of junctions to files data blocks.

 

But as for your issue with folder sizes not showing properly, this link will explain to you why that is: http://social.technet.microsoft.com/Forums/en-US/fa6f6329-f710-4c5e-9538-942045df68a3/windows-server-2012-with-deduplication-show-wrong-size-on-disk?forum=winserver8gen

 

Duplicated files are stored in the "System Volume Information" folder, and since it is not displayed you wont know the actual size on disk. Hard links are created to those files that were moved so they can still be accessed.

 

If DrivePool allowed for this folder, and was handeled by the virtual disk driver you use to manage what disks the actual data get moved to, then DeDuplication could be done. But right now when you try to add the DeDuplication setting to the DrivePool disk it give some strange error: https://dl.dropboxusercontent.com/u/48061/DrivePool/UnabletoDedupeDrivePool.png

 

It seems that however the virtual disk driver that is used does not allow proper access to it. I have no clue as to why that is, since i didn't make the DrivePool Software. Aslo what is the deal with Volume Shadow Copy not being supported on a Drive Pool? Is it a possibility of the driver that makes the DrivePool virtual disk has major incompatibilities to be able to do that, and not being able to write to the "System Volume Information" directory?

 

Anyone please let me know your thoughts.

Link to comment
Share on other sites

  • 0

It's not that I don't understand it, it's that there are a number of ways that it is done. And since I haven't really paid close attention to it (as I'm not really a fan), I wasn't sure how microsoft implemented it.

 

 

As for DrivePool, the linked article implies that it uses NTFS Reparse points for the files. DrivePool doesn't support that right now. But Alex is working hard to add support for them. Once that's added, DeDup *may* work. If it doesn't, let us know and Alex will take a look at it.

Link to comment
Share on other sites

  • 0

Regarding Linux arguments, I couldn't agree more. To me an OS is a means to an end. Then there's Abercombi... err Apple.

The thing about Deduplication is that; unless DP became very intelligent about it's file handling, it would be very unlikely that DP could handle proper file placement for DP to work. Dedup is a per volume attribute. DP is an overlay very akin to UnionFS or AuFS in concept utilizing the .NET libraries. It would be like using AuFS with ZFS (Dedup enabled). The difference is that DP isn't linking directly to the files and is instead working with the file allocation table. If what gregcaulder said is true, then that means that the files are linking outside of the DP on the same volume into a folder that isn't available to the user.

The best solution would be a plug-in that keeps a per cluster hash of all the files (On my system it's about 620MB in size) that can match cluster SHA-256 and when matched compare the two chunks byte for byte. If matches are found then the files are relocated to the same drive so that Deduplication can catch the matching clusters during it's next optimization cycle. This is assuming that the service is fixed so that it can understand Deduped volumes. I see the biggest problem being the way the service is written. UnionFS got around the issue by hijacking the File system so that the kernel handled all the data in it's respective formats. It was possible to mix and max drive formats because all the software did is push the requests forward to the volumes. The reason DP can't do this is because of it's duplication methods. Such a plug-in would by it's very nature, spit in the face of the duplication data preservation techniques. I use DP only for it's pooling capabilities so such a plug-in works fine for me.

I've never used a windows backup utility. The reason I like SnapRAID so much is that it works on the files, not the drives. That can be a bit of a snag as well, since I could end up needing a 3TB drive for parity for a series of 2TB drives as one or more drives contain 2.8 TB of data due to really good dedupping. Once a month (I do it on the 1st) after a sync of the files I run 'snapraid check' and if anything comes back in the log, I fix it. A sync uses last modify dates, so bit flipping and such gets flagged during the check as the hash won't match the damaged file. Barring some crazy astronomical phenomenon in which the same sector is bad on three disks, it can be easily fixed with a simple command. I can also undelete files with a blanket command that touches all the disks in the same location for the file (due to DP I have no idea which disk it's on) and restore it back to the last sync. I've been writing a front-end for SnapRAID for a while since the one they had was ultra user friendly and I needed something that could work with DP. I've only lost one drive and SnapRAID brought it back onto a new volume. I was impressed. Granted it took eight hours, but it restored 1.7TB of data that I would have lost otherwise. I added a q-parity to make sure that I won't loose anything.

I really need to keep my posts shorter.

Link to comment
Share on other sites

  • 0

To be honest, the Windows Backup feature is primarily good for imaging the system. I mean, you can use it for backing up data... but it doesn't do a good job of that IMO. But I've used it to back up just the disks with the OS and system Roles, and have used it to restore more than a few times. And it does work great for that.

 

But SnapRAID looks like a good way to help ensure data integrity.

Link to comment
Share on other sites

  • 0

It's not that I don't understand it, it's that there are a number of ways that it is done. And since I haven't really paid close attention to it (as I'm not really a fan), I wasn't sure how microsoft implemented it.

 

 

As for DrivePool, the linked article implies that it uses NTFS Reparse points for the files. DrivePool doesn't support that right now. But Alex is working hard to add support for them. Once that's added, DeDup *may* work. If it doesn't, let us know and Alex will take a look at it.

 

Does this mean that Volume Shadow Copy will be able to work if/when reparse points are added to Drive Pool?

Not quite sure how VSS works exactly, but I know that is uses the "System Volume Information" folder.

The would be a huge plus if i could get VSS.

Link to comment
Share on other sites

  • 0

Does this mean that Volume Shadow Copy will be able to work if/when reparse points are added to Drive Pool?

Not quite sure how VSS works exactly, but I know that is uses the "System Volume Information" folder.

The would be a huge plus if i could get VSS.

I'm not sure. I'm pretty sure it's more complicated than that.

But it is possible that we may be able to get that working in the future. At least for Previous Versions and the like. It's on Alex's to-do list.

Link to comment
Share on other sites

  • 0

Deduplication will work on individual drives.  You just need to disable the "Bypass file system filters" option to get it working.

 

As for on the pool directly? This isn't a priority, unfortunately. 

Adding VSS support isn't going to be easy, remotely. Supporting Reparse points was very difficult, and that's rather well documented.  Adding VSS support is going to require completely reverse engineering the feature and then *hoping* it will work. 

And that's assuming the deduplication doesn't rely on other file system features or undocumented features.

Link to comment
Share on other sites

  • 0

For those looking into this article in the future - yes this is possible, but you must configure DrivePool to use the NtfsFilters rather than bypassing them and going straight to disk.

 

From article:

http://community.covecube.com/index.php?/topic/962-server-2012-r2-data-deduplication-w-drivepool/

 

However, you will need to make one change first to ensure that it works properly:
Set "CoveFs_BypassNtfsFilters" to "false", and reboot the system.
 
I am currently receiving up to 70% savings from deduplication, and have 3 drives in my pool.
Link to comment
Share on other sites

  • 0

 

For those looking into this article in the future - yes this is possible, but you must configure DrivePool to use the NtfsFilters rather than bypassing them and going straight to disk.

 

From article:

http://community.covecube.com/index.php?/topic/962-server-2012-r2-data-deduplication-w-drivepool/

 

However, you will need to make one change first to ensure that it works properly:
Set "CoveFs_BypassNtfsFilters" to "false", and reboot the system.
 
I am currently receiving up to 70% savings from deduplication, and have 3 drives in my pool.

 

 

 

Actually, in the public beta build (2.2.0.651), we change this behavior.  

 

Specifically, if we detect the "Dedup" file system filter is installed, we disable the bypass, for compatibility.  And since that the filter is not present until you install the Deduplication role, it's pretty much "ideal" behavior. 

 

Additionally, we have added a per pool option for this.  So that you can disable it from some pools but not others.  The config setting will override the default behavior, but the service will change the setting back, at each boot. 

 

You can see the specific changelog entries here:

* [Issue #13517] When the "dedup" file system filter is installed, "Bypass file system filters" is overridden and 
                 disabled. This allows pooled drives that are utilizing data deduplication to work correctly with 
                 StableBit DrivePool.
* [Issue #13517] Added a new "Bypass file system filters" per-pool option under Pool Options -> Performance.
                 This promotes the advanced setting "CoveFs_BypassNtfsFilters" into a UI option. The advanced setting 
                 still exists and serves as a default for new pools and pools that did not have the UI option previously.
                 See the tool tip for an explanation on what it does. This is now a real-time option that is enabled / 
                 disabled in the file system at the instant that you check it or uncheck it. At boot time the setting 
                 is preserved and loaded at the file system level before any system services are started. The setting is 
                 enabled by default.
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...