Jump to content
  • 3

Beware of DrivePool corruption / data leakage / file deletion / performance degradation scenarios Windows 10/11


MitchC

Question

To start, while new to DrivePool I love its potential I own multiple licenses and their full suite.  If you only use drivepool for basic file archiving of large files with simple applications accessing them for periodic reads it is probably uncommon you would hit these bugs.  This assumes you don't use any file synchronization / backup solutions.  Further, I don't know how many thousands (tens or hundreds?) of DrivePool users there are, but clearly many are not hitting these bugs or recognizing they are hitting these bugs, so this IT NOT some new destructive my files are 100% going to die issue.  Some of the reports I have seen on the forums though may be actually issues due to these things without it being recognized as such. As far as I know previously CoveCube was not aware of these issues, so tickets may not have even considered this possibility.

I started reporting these bugs to StableBit ~9 months ago, and informed I would be putting this post together ~1 month ago.  Please see the disclaimer below as well, as some of this is based on observations over known facts.

You are most likely to run into these bugs with applications that: *) Synchronize or backup files, including cloud mounted drives like onedrive or dropbox *) Applications that must handle large quantities of files or monitor them for changes like coding applications (Visual Studio/ VSCode)


Still, these bugs can cause silent file corruption, file misplacement, deleted files, performance degradation, data leakage ( a file shared with someone externally could have its contents overwritten by any sensitive file on your computer), missed file changes, and potential other issues for a small portion of users (I have had nearly all these things occur).  It may also trigger some BSOD crashes, I had one such crash that is likely related.  Due to the subtle nature some of these bugs can present with, it may be hard to notice they are happening even if they are.  In addition, these issues can occur even without file mirroring and files pinned to a specific drive.  I do have some potential workarounds/suggestions at the bottom.

More details are at the bottom but the important bug facts upfront:

  • Windows has a native file changed notification API using overlapped IO calls.  This allows an application to listen for changes on a folder, or a folder and sub folders, without having to constantly check every file to see if it changed.  Stablebit triggers "file changed" notifications even when files are just accessed (read) in certain ways.  Stablebit does NOT generate notification events on the parent folder when a file under it changes (Windows does).  Stablebit does NOT generate a notification event only when a FileID changes (next bug talks about FileIDs).

 

  • Windows, like linux, has a unique ID number for each file written on the hard drive.  If there are hardlinks to the same file, it has the same unique ID (so one File ID may have multiple paths associated with it). In linux this is called the inode number, Windows calls it the FileID.  Rather than accessing a file by its path, you can open a file by its FileID.  In addition it is impossible for two files to share the same FileID, it is a 128 bit number persistent across reboots (128 bits means the number of unique numbers represented is 39 digits long, or has the uniqueness of something like the MD5 hash).  A FileID does not change when a file moves or is modified.  Stablebit, by default, supports FileIDs however they seem to be ephemeral, they do not seem to survive across reboots or file moves.  Keep in mind FileIDs are used for directories as well, it is not just files. Further, if a directory is moved/renamed not only does its FileID change but every file under it changes. I am not sure if there are other situations in which they may change.  In addition, if a descendant file/directory FileID changes due to something like a directory rename Stablebit does NOT generate a notification event that it has changed (the application gets the directory event notification but nothing on the children).


There are some other things to consider as well, DrivePool does not implement the standard windows USN Journal (a system of tracking file changes on a drive).  It specifically identifies itself as not supporting this so applications shouldn't be trying to use it with a drivepool drive. That does mean that applications that traditionally don't use the file change notification API or the FileIDs may fall back to a combination of those to accomplish what they would otherwise use the USN Journal for (and this can exacerbate the problem).  The same is true of Volume Shadow Copy (VSS) where applications that might traditionally use this cannot (and drivepool identifies it cannot do VSS) so may resort to methods below that they do not traditionally use.


Now the effects of the above bugs may not be completely apparent:

  • For the overlapped IO / File change notification 

This means an application monitoring for changes on a DrivePool folder or sub-folder will get erroneous notifications files changed when anything even accesses them. Just opening something like file explorer on a folder, or even switching between applications can cause file accesses that trigger the notification. If an application takes actions on a notification and then checks the file at the end of the notification this in itself may cause another notification.  Applications that rely on getting a folder changed notification when a child changes will not get these at all with DrivePool.  If it isn't monitoring children at all just the folder, this means no notifications could be generated (vs just the child) so it could miss changes.

  • For FileIDs

It depends what the application uses the FileID for but it may assume the FileID should stay the same when a file moves, as it doesn't with DrivePool this might mean it reads or backs up, or syncs the entire file again if it is moved (perf issue).  An application that uses the Windows API to open a File by its ID may not get the file it is expecting or the file that was simply moved will throw an error when opened by its old FileID as drivepool has changed the ID.   For an example lets say an application caches that the FileID for ImportantDoc1.docx is 12345 but then 12345 refers to ImportantDoc2.docx due to a restart.  If this application is a file sync application and ImportantDoc1.docx is changed remotely when it goes to write those remote changes to the local file if it uses the OpenFileById method to do so it will actually override ImportantDoc2.docx with those changes.

I didn't spend the time to read Windows file system requirements to know when Windows expects a FileID to potentially change (or not change).  It is important to note that even if theoretical changes/reuse are allowed if they are not common place (because windows uses essentially a number like an md5 hash in terms of repeats) applications may just assume it doesn't happen even if it is technically allowed to do so.  A backup of file sync program might assume that a file with specific FileID is always the same file, if FileID 12345 is c:\MyDocuments\ImportantDoc1.docx one day and then c:\MyDocuments\ImportantDoc2.docx another it may mistake document 2 for document 1, overriding important data or restore data to the wrong place.  If it is trying to create a whole drive backup it may assume it has already backed up c:\MyDocuments\ImportantDoc2.docx if it now has the same File ID as ImportantDoc1.docx by the time it reaches it (at which point DrivePool would have a different FileID for Document1).


Why might applications use FileIDs or file change notifiers? It may not seem intuitive why applications would use these but a few major reasons are: *) Performance, file change notifiers are a event/push based system so the application is told when something changes, the common alternative is a poll based system where an application must scan all the files looking for changes (and may try to rely on file timestamps or even hashing the entire file to determine this) this causes a good bit more overhead / slowdown.  *)  FileID's are nice because they already handle hardlink file de-duplication (Windows may have multiple copies of a file on a drive for various reasons, but if you backup based on FileID you backup that file once rather than multiple times.  FileIDs are also great for handling renames.  Lets say you are an application that syncs files and the user backs up c:\temp\mydir with 1000 files under it.  If they rename c:\temp\mydir to c:\temp\mydir2 an application use FileIDS can say, wait that folder is the same it was just renamed. OK rename that folder in our remote version too.  This is a very minimal operation on both ends.  With DrivePool however the FileID changes for the directory and all sub-files.  If the sync application uses this to determine changes it now uploads all these files to the system using a good bit more resources locally and remotely.  If the application also uses versioning this may be far more likely to cause a conflict with two or more clients syncing, as mass amounts of files are seemingly being changed.

Finally, even if an application is trying to monitor for FileIDs changing using the file change API, due to notification bugs above it may not get any notifications when child FileIDs change so it might assume it has not.


Real Examples
OneDrive
This started with massive onedrive failures.  I would find onedrive was re-uploading hundreds of gigabytes of images an videos multiple times a week.  These were not changing or moving.  I don't know if the issue is onedrive uses FileIDs to determine if a file is already uploaded, or if it is because when it scanned a directory it may have triggered a notification that all the files in that directory changed and based on that notification it reuploads.  After this I noticed files were becoming deleted both locally and in the cloud.  I don't know what caused this, it might have been because the old file it thought was deleted as the FileID was gone and while there was a new file (actually the same file) in its place there may have been some odd race condition.   It is also possible that it queued the file for upload, the FileID changed and when it went to open it to upload it found it was 'deleted' as the FileID no longer pointed to a file and queued the delete operation.   I also found that files that were uploaded into the cloud in one folder were sometimes downloading to an alternate folder locally.  I am guessing this is because the folder FileID changed.  It thought the 2023 folder was with ID XYZ but that now pointed to a different folder and so it put the file in the wrong place.  The final form of corruption was finding the data from one photo or video actually in a file with a completely different name.  This is almost guaranteed to be due to the FileID bugs.  This is highly destructive as backups make this far harder to correct.  With one files contents replaced with another you need to know when the good content existed and in what files were effected.  Depending on retention policies the file contents that replaced it may override the good backups before you notice.  I also had a BSOD with onedrive where it was trying to set attributes on a file and the CoveFS driver corrupted some memory.  It is possible this was a race condition as onedrive may have been doing hundreds of files very rapidly due to the bugs.  I have not captured a second BSOD due to it, but also stopped using onedrive on DrivePool due to the corruption.   Another example of this is data leakage.  Lets say you share your favorite article on kittens with a group of people.   Onedrive, believing that file has changed, goes to open it using the FileID however that file ID could essentially now correspond to any file on your computer now the contents of some sensitive file are put in the place of that kitten file, and everyone you shared it with can access it.

Visual Studio Failures
Visual studio is a code editor/compiler.  There are three distinct bugs that happen.  First, when compiling if you touched one file in a folder it seemed to recompile the entire folder, this due likely to the notification bug.  This is just a slow down, but an annoying one.  Second, Visual Studio has compiler generated code support.  This means the compiler will generate actual source code that lives next to your own source code.   Normally once compiled it doesn't regenerate and compile this source unless it must change but due to the notification bugs it regenerates this code constantly and if there is an error in other code it causes an error there causing several other invalid errors.  When debugging visual studio by default will only use symbols (debug location data) as the notifications from DrivePool happen on certain file accesses visual studio constantly thinks the source has changed since it was compiled and you will only be able to breakpoint inside source if you disable the exact symbol match default.  If you have multiple projects in a solution with one dependent on another it will often rebuild other project deps even when they haven't changed, for large solutions that can be crippling (perf issue).  Finally I often had intellisense errors showing up even though no errors during compiling, and worse intellisense would completely break at points.  All due to DrivePool.


Technical details / full background & disclaimer

I have sample code and logs to document these issues in greater detail if anyone wants to replicate it themselves.

It is important for me to state drivepool is closed source and I don't have the technical details of how it works.  I also don't have the technical details on how applications like onedrive or visual studio work.  So some of these things may be guesses as to why the applications fail/etc.

The facts stated are true (to the best of my knowledge) 


Shortly before my trial expired in October of last year I discovered some odd behavior.  I had a technical ticket filed within a week and within a month had traced down at least one of the bugs.  The issue can be seen https://stablebit.com/Admin/IssueAnalysis/28720 , it does show priority 2/important which I would assume is the second highest (probably critical or similar above).  It is great it has priority but as we are over 6 months since filed without updates I figured warning others about the potential corruption was important.  


The FileSystemWatcher API is implemented in windows using async overlapped IO the exact code can be seen: https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.Win32.cs#L32-L66

That corresponds to this kernel api:
https://learn.microsoft.com/en-us/windows/win32/fileio/synchronous-and-asynchronous-i-o

Newer api calls use GetFileInformationByHandleEx to get the FileID but with older stats calls represented by nFileIndexHigh/nFileIndexLow.  


In terms of the FileID bug I wouldn't normally have even thought about it but the advanced config (https://wiki.covecube.com/StableBit_DrivePool_2.x_Advanced_Settings) mentions this under CoveFs_OpenByFileId  "When enabled, the pool will keep track of every file ID that it gives out in pageable memory (memory that is saved to disk and loaded as necessary).".   Keeping track of files in memory is certainly very different from Windows so I thought this may be the source of issue.  I also don't know if there are caps on the maximum number of files it will track as if it resets FileIDs in situations other than reboots that could be much worse. Turning this off will atleast break nfs servers as it mentions it right in the docs "required by the NFS server".

Finally, the FileID numbers given out by DrivePool are incremental and very low.  This means when they do reset you almost certainly will get collisions with former numbers.   What is not clear is if there is the chance of potential FileID corruption issues.  If when it is assigning these ids in a multi-threaded scenario with many different files at the same time could this system fail? I have seen no proof this happens, but when incremental ids are assigned like this for mass quantities of potential files it has a higher chance of occurring.

Microsoft mentions this about deleting the USN Journal: "Deleting the change journal impacts the File Replication Service (FRS) and the Indexing Service, because it requires these services to perform a complete (and time-consuming) scan of the volume. This in turn negatively impacts FRS SYSVOL replication and replication between DFS link alternates while the volume is being rescanned.".  Now DrivePool never has the USN journal supported so it isn't exactly the same thing, but it is clear that several core Windows services do use it for normal operations I do not know what backups they use when it is unavailable. 


Potential Fixes
There are advanced settings for drivepool https://wiki.covecube.com/StableBit_DrivePool_2.x_Advanced_Settings beware these changes may break other things.
CoveFs_OpenByFileId - Set to false, by default it is true.  This will disable the OpenByFileID API.  It is clear several applications use this API.  In addition, while DrivePool may disable that function with this setting it doesn't disable FileID's themselves.  Any application using FileIDs as static identifiers for files may still run into problems. 

I would avoid any file backup/synchronization tools and DrivePool drives (if possible).  These likely have the highest chance of lost files, misplaced files, file content being mixed up, and excess resource usage.   If not avoiding consider taking file hashes for the entire drivepool directory tree.  Do this again at a later point and make sure files that shouldn't have changed still have the same hash.

If you have files that rarely change after being created then hashing each file at some point after creation and alerting if that file disappears or hash changes would easily act as an early warning to a bug here being hit.

Link to comment
Share on other sites

Recommended Posts

  • 0
5 hours ago, JC_RFC said:

If there was no Object-ID then I agree with the overhead point. However the Drivepool developer has ALREADY gone to all the trouble and overhead of generating unique 128bit Object-ID's for every file on the pool.

This is why I feel it should be trivial to now also populate the File-ID with a unique 64bit value derived from this Object-ID. All the hard work has already been done.

My testing so far hasn't seen DrivePool automatically generating Object-ID records for files on the pool; if all the files on your pool have Object-ID records you may want to look for whatever on your system is doing that.

I suspect that trivial in theory ends up being a lot of work in practice. Not only do you need to do populate the File-ID records in your virtual MFT from the existing Object-ID records in the underlaying physical $ObjID files across all of your poolpart voumes, you also need to be able to generate new File-ID records whenever new files are created on the pool and immediately update those $ObjID files accordingly, you need to ensure these records are correctly propagating during balancing/placement/duplication changes, you need to add the appropriate detect/repair routines to your consistency checking, and you need to do all of this as efficiently and safely as possible.

Link to comment
Share on other sites

  • 0

You are right, I don't know where I got this idea that Drivepool had object-id's for all files from.

I just checked and my files do not have an object_id. So yes, lots of work from here to have unique file_id's, agreed.

Link to comment
Share on other sites

  • 0
Quote

Our File IDs persist until the next reboot. We avoid using fully persistent File IDs to enhance performance.

Guess that's that then, pretty much. A compromise made for having every drive intact with its own NTFS volume, emulating NTFS on top of NTFS.

I keep thinking I'd prefer my pool to be on block level between hardware layer and file system. We'd loose the ability to access drives individually via direct NTFS mounting outside the pool (which I guess is important to the developer and at least initial users), but it would have been a true NTFS on top of drives, formatted normally and with full integrity (whatever optional behavior NTFS is actually using). Any developer could then lean on experience they make on any real NTFS, and get the same functionality here.

If not using striping across drives, virtual drive could easily place entire file byte arrays on individual drives without splitting. Drives would then still not have to be reliant on eachother to recover data from specific drives like typical raids, one could read via the virtual drive whatever is on them by checking whatever proprietary byte array journal data one designs to attach to each file on block level. I'd personally prefer like something like that, at least from a glance when just thinking about it.

I'm pretty much in the @JC_RFC camp on this.

Thanks all for making updates on this.

Link to comment
Share on other sites

  • 0

Would i be safe from data corruption if I'm not using all the fancy stuff like read stripping & duplication?

I never enabled those, the only plugins i enabled are the scanner and data limiter, and i have file placement rules set so that things don't spread out and contained within their a specific drive, for example, choosing one SSD for a specific app(s) or game(s), and HDD for all other date.

I'm a new user so i haven't yet encountered any weird stuff.

Link to comment
Share on other sites

  • 0

Hi haoma.

The corruption isn't being caused by DrivePool's duplication feature (and while read-striping can have some issues with some older or... I'll say "edge-case" utilities, so I usually just leave it off anyway, that's also not the cause here).

The corruption risk comes in if any app relies on a file's ID number to remain unique and stable unless the file is modified or deleted, as that number is being changed by DrivePool even when the file is simply renamed, moved or if the drivepool machine is restarted - and in the latter case being re-used for completely different files.

TLDR: far as I know currently the most we can do is to change the Override value for CoveFs_OpenByFileId from null to false (see Advanced Settings). At least as of this post date it doesn't fix the File ID problem, but it does mean affected apps should either safely fall back to alternative methods or at least return some kind of error so you can avoid using them with a pool drive.

Link to comment
Share on other sites

  • 0

To be fair to Stablebit I used Drivepool for the past few years and have NEVER lost a single file because of Drivepool. The elaborateness OR simpleness of how you use Drivepool within itself is not really of concern.

What is being warned of here though is if you use any special applications that might expect FileID to behave as per NTFS there will be risks with that.

My example is that I use Freefilesync quite a bit to maintain files between my pool, my htpc and another backup location. When I move files on one drive, freefilesync using fileid recognises the file was moved so syncs a "move" on the remote filesystem as well. This saves potentially hours of copying and then deleting. It does not work on Drivepool because the fileid changes on each reboot. In this case Freefilesync fails "SAFE" in that it does the copy and delete instead, so I only suffer performance issues.

What could happen though is that you use another app that say cleans old files, or moves files around that does not fail safe if a fileid is repeated for a different file etc and in doing so you do suffer file loss. This will only happen if you use some third party application that makes changes to files. It's not the type of thing a word processor or a pc game etc are going to be doing (typically in case someone jumps in with a it could be possible argument).

So generally Drivepool is safe, and for you most likely of nothing to worry about, but if you do use an application now or in the future that is for cleaning up or syncing etc then be very careful in case it uses fileid and causes data loss because of this issue.

For day to day use, in my experience you can continue to use it as is. If you want to add to the group of us that would like this improved, feel free to add your voice to the request as otherwise I don't see any update for this in the foreseeable future.

Edited by JC_RFC
typo
Link to comment
Share on other sites

  • 0

Cross-quoting my workaround that definitely works:

1 hour ago, roirraWedorehT said:

Thanks for the details, and the work necessary to gather them and post about it.

I'm working around the issue currently by having a Hyper-V virtual PC set only for using Google Drive Backup and Sync, or whatever they happen to be calling it at this moment.  I gave the virtual PC a secondary virtual hard drive - I made it a 10 GB dynamically expanding drive, so it'll only take up as much space as it needs to, but is open to expansion for a good long while, and have Drive sync to a folder on that VHDX.  Then I have it shared over the network, and I assigned it the letter G: on my host desktop.

I know there are plenty of people out there who don't have experience with virtual PCs, but Windows' built-in Hyper-V makes it so easy and fast.

I would obviously prefer to use it locally, but this will work for now.  The sync app hasn't crashed at all, and everything is synced.

Edit:  Now that I think about it, creating a .VHDX locally on the host desktop through Disk Management, mounting it (as drive G: or whatever you prefer), and pointing Google Drive and Sync to that virtual drive sitting on a drivepool would probably work, as well - but I'm tired of playing around and chancing that something else will still cause issues, so I'll keep Google Drive and Sync in my virtual PC for the foreseeable future. 

Edit 2:  It's possible I actually tried that proposed second simpler solution first, but I still had issues.  The issues might've been because, when rebooting, or signing out and signing back in to Windows (if you ever do that), it's possible that the VHDX wasn't mounted before Google Drive Backup and Sync automatically launched.  While it's possible that there's a way to force Google Drive Backup and Sync to wait to launch, I either didn't look at that possibility, or didn't want to have to go down that road.

Or there were circumstances where the VHDX could be temporarily unmounted, causing Google's program to crash.

Anyone, please let me know if you test the local .VHDX solution, and it has no issues for weeks.

Cheers, all.

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...