Jump to content

Pafegori

Members
  • Posts

    6
  • Joined

  • Last visited

Reputation Activity

  1. Thanks
    Pafegori reacted to MitchC in Beware of DrivePool corruption / data leakage / file deletion / performance degradation scenarios Windows 10/11   
    To start, while new to DrivePool I love its potential I own multiple licenses and their full suite.  If you only use drivepool for basic file archiving of large files with simple applications accessing them for periodic reads it is probably uncommon you would hit these bugs.  This assumes you don't use any file synchronization / backup solutions.  Further, I don't know how many thousands (tens or hundreds?) of DrivePool users there are, but clearly many are not hitting these bugs or recognizing they are hitting these bugs, so this IT NOT some new destructive my files are 100% going to die issue.  Some of the reports I have seen on the forums though may be actually issues due to these things without it being recognized as such. As far as I know previously CoveCube was not aware of these issues, so tickets may not have even considered this possibility.
    I started reporting these bugs to StableBit ~9 months ago, and informed I would be putting this post together ~1 month ago.  Please see the disclaimer below as well, as some of this is based on observations over known facts.
    You are most likely to run into these bugs with applications that: *) Synchronize or backup files, including cloud mounted drives like onedrive or dropbox *) Applications that must handle large quantities of files or monitor them for changes like coding applications (Visual Studio/ VSCode)

    Still, these bugs can cause silent file corruption, file misplacement, deleted files, performance degradation, data leakage ( a file shared with someone externally could have its contents overwritten by any sensitive file on your computer), missed file changes, and potential other issues for a small portion of users (I have had nearly all these things occur).  It may also trigger some BSOD crashes, I had one such crash that is likely related.  Due to the subtle nature some of these bugs can present with, it may be hard to notice they are happening even if they are.  In addition, these issues can occur even without file mirroring and files pinned to a specific drive.  I do have some potential workarounds/suggestions at the bottom.
    More details are at the bottom but the important bug facts upfront:
    Windows has a native file changed notification API using overlapped IO calls.  This allows an application to listen for changes on a folder, or a folder and sub folders, without having to constantly check every file to see if it changed.  Stablebit triggers "file changed" notifications even when files are just accessed (read) in certain ways.  Stablebit does NOT generate notification events on the parent folder when a file under it changes (Windows does).  Stablebit does NOT generate a notification event only when a FileID changes (next bug talks about FileIDs).  
    Windows, like linux, has a unique ID number for each file written on the hard drive.  If there are hardlinks to the same file, it has the same unique ID (so one File ID may have multiple paths associated with it). In linux this is called the inode number, Windows calls it the FileID.  Rather than accessing a file by its path, you can open a file by its FileID.  In addition it is impossible for two files to share the same FileID, it is a 128 bit number persistent across reboots (128 bits means the number of unique numbers represented is 39 digits long, or has the uniqueness of something like the MD5 hash).  A FileID does not change when a file moves or is modified.  Stablebit, by default, supports FileIDs however they seem to be ephemeral, they do not seem to survive across reboots or file moves.  Keep in mind FileIDs are used for directories as well, it is not just files. Further, if a directory is moved/renamed not only does its FileID change but every file under it changes. I am not sure if there are other situations in which they may change.  In addition, if a descendant file/directory FileID changes due to something like a directory rename Stablebit does NOT generate a notification event that it has changed (the application gets the directory event notification but nothing on the children).
    There are some other things to consider as well, DrivePool does not implement the standard windows USN Journal (a system of tracking file changes on a drive).  It specifically identifies itself as not supporting this so applications shouldn't be trying to use it with a drivepool drive. That does mean that applications that traditionally don't use the file change notification API or the FileIDs may fall back to a combination of those to accomplish what they would otherwise use the USN Journal for (and this can exacerbate the problem).  The same is true of Volume Shadow Copy (VSS) where applications that might traditionally use this cannot (and drivepool identifies it cannot do VSS) so may resort to methods below that they do not traditionally use.

    Now the effects of the above bugs may not be completely apparent:
    For the overlapped IO / File change notification  This means an application monitoring for changes on a DrivePool folder or sub-folder will get erroneous notifications files changed when anything even accesses them. Just opening something like file explorer on a folder, or even switching between applications can cause file accesses that trigger the notification. If an application takes actions on a notification and then checks the file at the end of the notification this in itself may cause another notification.  Applications that rely on getting a folder changed notification when a child changes will not get these at all with DrivePool.  If it isn't monitoring children at all just the folder, this means no notifications could be generated (vs just the child) so it could miss changes.
    For FileIDs It depends what the application uses the FileID for but it may assume the FileID should stay the same when a file moves, as it doesn't with DrivePool this might mean it reads or backs up, or syncs the entire file again if it is moved (perf issue).  An application that uses the Windows API to open a File by its ID may not get the file it is expecting or the file that was simply moved will throw an error when opened by its old FileID as drivepool has changed the ID.   For an example lets say an application caches that the FileID for ImportantDoc1.docx is 12345 but then 12345 refers to ImportantDoc2.docx due to a restart.  If this application is a file sync application and ImportantDoc1.docx is changed remotely when it goes to write those remote changes to the local file if it uses the OpenFileById method to do so it will actually override ImportantDoc2.docx with those changes.
    I didn't spend the time to read Windows file system requirements to know when Windows expects a FileID to potentially change (or not change).  It is important to note that even if theoretical changes/reuse are allowed if they are not common place (because windows uses essentially a number like an md5 hash in terms of repeats) applications may just assume it doesn't happen even if it is technically allowed to do so.  A backup of file sync program might assume that a file with specific FileID is always the same file, if FileID 12345 is c:\MyDocuments\ImportantDoc1.docx one day and then c:\MyDocuments\ImportantDoc2.docx another it may mistake document 2 for document 1, overriding important data or restore data to the wrong place.  If it is trying to create a whole drive backup it may assume it has already backed up c:\MyDocuments\ImportantDoc2.docx if it now has the same File ID as ImportantDoc1.docx by the time it reaches it (at which point DrivePool would have a different FileID for Document1).

    Why might applications use FileIDs or file change notifiers? It may not seem intuitive why applications would use these but a few major reasons are: *) Performance, file change notifiers are a event/push based system so the application is told when something changes, the common alternative is a poll based system where an application must scan all the files looking for changes (and may try to rely on file timestamps or even hashing the entire file to determine this) this causes a good bit more overhead / slowdown.  *)  FileID's are nice because they already handle hardlink file de-duplication (Windows may have multiple copies of a file on a drive for various reasons, but if you backup based on FileID you backup that file once rather than multiple times.  FileIDs are also great for handling renames.  Lets say you are an application that syncs files and the user backs up c:\temp\mydir with 1000 files under it.  If they rename c:\temp\mydir to c:\temp\mydir2 an application use FileIDS can say, wait that folder is the same it was just renamed. OK rename that folder in our remote version too.  This is a very minimal operation on both ends.  With DrivePool however the FileID changes for the directory and all sub-files.  If the sync application uses this to determine changes it now uploads all these files to the system using a good bit more resources locally and remotely.  If the application also uses versioning this may be far more likely to cause a conflict with two or more clients syncing, as mass amounts of files are seemingly being changed.
    Finally, even if an application is trying to monitor for FileIDs changing using the file change API, due to notification bugs above it may not get any notifications when child FileIDs change so it might assume it has not.

    Real Examples
    OneDrive
    This started with massive onedrive failures.  I would find onedrive was re-uploading hundreds of gigabytes of images an videos multiple times a week.  These were not changing or moving.  I don't know if the issue is onedrive uses FileIDs to determine if a file is already uploaded, or if it is because when it scanned a directory it may have triggered a notification that all the files in that directory changed and based on that notification it reuploads.  After this I noticed files were becoming deleted both locally and in the cloud.  I don't know what caused this, it might have been because the old file it thought was deleted as the FileID was gone and while there was a new file (actually the same file) in its place there may have been some odd race condition.   It is also possible that it queued the file for upload, the FileID changed and when it went to open it to upload it found it was 'deleted' as the FileID no longer pointed to a file and queued the delete operation.   I also found that files that were uploaded into the cloud in one folder were sometimes downloading to an alternate folder locally.  I am guessing this is because the folder FileID changed.  It thought the 2023 folder was with ID XYZ but that now pointed to a different folder and so it put the file in the wrong place.  The final form of corruption was finding the data from one photo or video actually in a file with a completely different name.  This is almost guaranteed to be due to the FileID bugs.  This is highly destructive as backups make this far harder to correct.  With one files contents replaced with another you need to know when the good content existed and in what files were effected.  Depending on retention policies the file contents that replaced it may override the good backups before you notice.  I also had a BSOD with onedrive where it was trying to set attributes on a file and the CoveFS driver corrupted some memory.  It is possible this was a race condition as onedrive may have been doing hundreds of files very rapidly due to the bugs.  I have not captured a second BSOD due to it, but also stopped using onedrive on DrivePool due to the corruption.   Another example of this is data leakage.  Lets say you share your favorite article on kittens with a group of people.   Onedrive, believing that file has changed, goes to open it using the FileID however that file ID could essentially now correspond to any file on your computer now the contents of some sensitive file are put in the place of that kitten file, and everyone you shared it with can access it.
    Visual Studio Failures
    Visual studio is a code editor/compiler.  There are three distinct bugs that happen.  First, when compiling if you touched one file in a folder it seemed to recompile the entire folder, this due likely to the notification bug.  This is just a slow down, but an annoying one.  Second, Visual Studio has compiler generated code support.  This means the compiler will generate actual source code that lives next to your own source code.   Normally once compiled it doesn't regenerate and compile this source unless it must change but due to the notification bugs it regenerates this code constantly and if there is an error in other code it causes an error there causing several other invalid errors.  When debugging visual studio by default will only use symbols (debug location data) as the notifications from DrivePool happen on certain file accesses visual studio constantly thinks the source has changed since it was compiled and you will only be able to breakpoint inside source if you disable the exact symbol match default.  If you have multiple projects in a solution with one dependent on another it will often rebuild other project deps even when they haven't changed, for large solutions that can be crippling (perf issue).  Finally I often had intellisense errors showing up even though no errors during compiling, and worse intellisense would completely break at points.  All due to DrivePool.

    Technical details / full background & disclaimer
    I have sample code and logs to document these issues in greater detail if anyone wants to replicate it themselves.
    It is important for me to state drivepool is closed source and I don't have the technical details of how it works.  I also don't have the technical details on how applications like onedrive or visual studio work.  So some of these things may be guesses as to why the applications fail/etc.
    The facts stated are true (to the best of my knowledge) 

    Shortly before my trial expired in October of last year I discovered some odd behavior.  I had a technical ticket filed within a week and within a month had traced down at least one of the bugs.  The issue can be seen https://stablebit.com/Admin/IssueAnalysis/28720 , it does show priority 2/important which I would assume is the second highest (probably critical or similar above).  It is great it has priority but as we are over 6 months since filed without updates I figured warning others about the potential corruption was important.  

    The FileSystemWatcher API is implemented in windows using async overlapped IO the exact code can be seen: https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.Win32.cs#L32-L66
    That corresponds to this kernel api:
    https://learn.microsoft.com/en-us/windows/win32/fileio/synchronous-and-asynchronous-i-o
    Newer api calls use GetFileInformationByHandleEx to get the FileID but with older stats calls represented by nFileIndexHigh/nFileIndexLow.  

    In terms of the FileID bug I wouldn't normally have even thought about it but the advanced config (https://wiki.covecube.com/StableBit_DrivePool_2.x_Advanced_Settings) mentions this under CoveFs_OpenByFileId  "When enabled, the pool will keep track of every file ID that it gives out in pageable memory (memory that is saved to disk and loaded as necessary).".   Keeping track of files in memory is certainly very different from Windows so I thought this may be the source of issue.  I also don't know if there are caps on the maximum number of files it will track as if it resets FileIDs in situations other than reboots that could be much worse. Turning this off will atleast break nfs servers as it mentions it right in the docs "required by the NFS server".
    Finally, the FileID numbers given out by DrivePool are incremental and very low.  This means when they do reset you almost certainly will get collisions with former numbers.   What is not clear is if there is the chance of potential FileID corruption issues.  If when it is assigning these ids in a multi-threaded scenario with many different files at the same time could this system fail? I have seen no proof this happens, but when incremental ids are assigned like this for mass quantities of potential files it has a higher chance of occurring.
    Microsoft mentions this about deleting the USN Journal: "Deleting the change journal impacts the File Replication Service (FRS) and the Indexing Service, because it requires these services to perform a complete (and time-consuming) scan of the volume. This in turn negatively impacts FRS SYSVOL replication and replication between DFS link alternates while the volume is being rescanned.".  Now DrivePool never has the USN journal supported so it isn't exactly the same thing, but it is clear that several core Windows services do use it for normal operations I do not know what backups they use when it is unavailable. 

    Potential Fixes
    There are advanced settings for drivepool https://wiki.covecube.com/StableBit_DrivePool_2.x_Advanced_Settings beware these changes may break other things.
    CoveFs_OpenByFileId - Set to false, by default it is true.  This will disable the OpenByFileID API.  It is clear several applications use this API.  In addition, while DrivePool may disable that function with this setting it doesn't disable FileID's themselves.  Any application using FileIDs as static identifiers for files may still run into problems. 
    I would avoid any file backup/synchronization tools and DrivePool drives (if possible).  These likely have the highest chance of lost files, misplaced files, file content being mixed up, and excess resource usage.   If not avoiding consider taking file hashes for the entire drivepool directory tree.  Do this again at a later point and make sure files that shouldn't have changed still have the same hash.
    If you have files that rarely change after being created then hashing each file at some point after creation and alerting if that file disappears or hash changes would easily act as an early warning to a bug here being hit.
×
×
  • Create New...