Corruption: Clouddrive just started mass deleting chunks

modplan · March 8, 2016

Google Drive

.486

win2k8r2

I've paused upload threads and it has stopped, but about 2500 chunks were just "permanently deleted"

I copied about 775GB into a brand new 10TB drive. Over 440GB of that was successfully uploaded and it was chugging along like normal. Nothing changed. Watching technical details I noticed suddenly the upload threads were popping up at 0%, never progressing, then new ones would pop up at 0% and this would continue over and over. No errors were thrown by the GUI, nothing abnormal in the logs.

I logged in to the Google Drive web GUI and I see tons of this:

Via trial and error I found some video files that had chunks in this range (~28,800 to ~31,300) When I try to play these files, tons of prefetch threads spawn in this range, stay at 0% and the "Prefetched" count jumps up to 500mb instantly. The file never plays in VLC. If I try to open an image file that is in chunks in this range, windows image viewer tells me it is corrupt.

Any ideas how this could possibly happen? Pretty disappointed right now, this is some serious corruption.

EDIT: I've uploaded what logs I have, but I do not see anything abnormal in them. I did NOT have drive tracing enabled and am obviously scared to enable it and resume uploads for fear of more chunks being deleted from the cloud!

Christopher (Drashna) · March 8, 2016

To let you know, there are a few circumstances that a provider can delete chunks.

Destroying a drive
chunk contains only 0's (provider specific, Google Drive being one of them).
(Google Drive only) When the mime type error is generated, delete the chunk and re-upload it

These are the only times that it should ever delete a chunk on the pool. Period.

However, what you're describing about the "0%" parts, I've talked with Alex already, and this is normal. At least for chunks with no data.

I've also flagged the issue for alex and have marked it as critical. However, I am not sure how much we'll get, without having the tracing enabled.

https://stablebit.com/Admin/IssueAnalysis/25946

That said, this is very very odd, and you're the only one to report something like this. And Alex has been doing extensive testing for at least the last few days.

Also, could you check the Event Viewer on the system? Check the "WIndows Logs", and the "System" section. Check for any disk, ntfs or controller related errors. That or just export the entire log (unfiltered) and upload it using the same link you used for the logs. That way, we can take a look at it.

Also, I'd highly recommend running a memory test (extended), if you haven't already done so recently.

And it may be worth turning on Upload Verification, in this case. That downloads and checks the chunks after upload. That may help prevent issues from occurring.

modplan · March 8, 2016

Thanks Chris. Just to point some things out one by one:

To let you know, there are a few circumstances that a provider can delete chunks.

Destroying a drive

chunk contains only 0's (provider specific, Google Drive being one of them).

(Google Drive only) When the mime type error is generated, delete the chunk and re-upload it

These are the only times that it should ever delete a chunk on the pool. Period.

The drive was not destroyed, we were just uploading along.
As for only zeroes, well I guess that could be a possibility, but only if clouddrive was actively overwriting these previously uploaded chunks with all 0's and I don't see why it would do so.
No MIME errors were generated in the GUI (very familiar with this error, believe I was the first to report it) and the chunks were never re-uploaded, theyre gone.

However, what you're describing about the "0%" parts, I've talked with Alex already, and this is normal. At least for chunks with no data.

The chunks had data, they had been uploaded over 24 hours prior, then I saw tons of writes at 0% and then Google Drive reported the chunks as "permanently deleted", I verified the exact chunk IDs I saw at 0% were the exact ones shown as deleted in the GDrive GUI. Unless you are commenting on the prefetch threads stuck at 0%? If so, the chunks not only have no data now, they are completely gone from Google Drive.

That said, this is very very odd, and you're the only one to report something like this. And Alex has been doing extensive testing for at least the last few days.

I agree, I've had to have uploaded 10-15TB to various drives on almost every BETA version testing this product over the past 3-4 months. This is the first time I've seen this. And to see it happen on its own in the middle of an upload cycle with no action or intervention on my part, is scary.

Also, could you check the Event Viewer on the system? Check the "WIndows Logs", and the "System" section. Check for any disk, ntfs or controller related errors. That or just export the entire log (unfiltered) and upload it using the same link you used for the logs. That way, we can take a look at it.

Also, I'd highly recommend running a memory test (extended), if you haven't already done so recently.

Don't see anything out of the ordinary in 'System' nor 'Application' around the time this started. A memtest is likely worth doing, but I don't think a bad DIMM would cause clouddrive to suddenly go on a deletion spree. As we can see in the above google drive screenshot, google lists the application that initiated the action, in this case a permanent delete, and we can see "Stablebit Clo..."

And it may be worth turning on Upload Verification, in this case. That downloads and checks the chunks after upload. That may help prevent issues from occurring.

Upload verification is on and has been on since this drive's creation. In fact, after the writes that were at 0% that deleted the chunks were finished, Read verify threads were spawned, which were also stuck at 0%, effectively verifying the delete. It really seems to me that Cloud Drive really thought it should delete these chunks, and even verified doing so, but it caused corruption.

-------------------

I'm not terribly concerned about this specific data, it was a backup. All I will lose if I have to blow this drive away and start over is a few days of upload time. However I want to provide as much info as possible to hopefully root cause this and prevent this from ever happening again, to me, or anyone else. Right now I have files that show up in windows explorer that look normal (due to pinned metadata I assume) but that are completely corrupt. If I had not been watching "Technical Details" when this occurred and started investigating I would assume that all my data is perfectly safe right now in the cloud, it isn't.

-------------------

I resumed the upload cycle late last night several hours after my post, since I've pretty much written off this drive and all of its data, I figured I would let it resume uploading and see if it somehow magically re-uploaded all of the chunks it deleted. If it does, we still have a problem, since for about 24 hours these chunks will have existed neither in the cloud nor in the cache, so the files are corrupt, but I would be glad if all of my data did eventually become whole.

So far, no more deletions have occurred, all night we were chugging along uploading. Currently we are uploading in the chunk ID range of 57,XXX, a far cry from the ~28,800 to ~31,300 range that was deleted. But maybe we will circle back after all new data has been uploaded and re-upload these chunks? We'll see, in about 20 hours.

modplan · March 8, 2016

Possibly relevant

AFTER all data had been copied to this drive and was in the To Upload "cache". The drive was marked as read only with diskpart:

att vol set readonly

I do this semi-often with physical drives that are meant for archive purposes ONLY, in order to prevent any kind of corruption, accidental deletion, etc. They are only marked as RW when data needs to be written, then back to RO they go.

This is the first time I have done this on a CloudDrive drive. I do not think setting this NTFS flag should have any interaction with CloudDrive or cause any issues, but I wanted to point this out, since it is the only anomaly I can currently find that is different than my previous testing and uploads.

modplan · March 8, 2016

We just went on another chunk deletion spree, starting about 20 minutes ago according to the GDrive GUI.

I enabled Drive Tracing for about 5 minutes and let it continue deleting. I then paused uploads and am collecting/uploading logs.

Very very interested to see what is shown in the logs.

Christopher (Drashna) · March 8, 2016

Modplan:

As for the deletion, I just wanted to make sure you knew the cases in which we do delete files from the provider, and the *only* cases which it should.

As for the memory test, this is to make sure that the "WriteMap" and other in memory objects aren't getting corrupted by bad memory. And since NTFS uses the memory extensively for caching, it's a good idea in general.

As for the diskpart stuff, was this for the CloudDrive disk, correct?
If so, well I don't think this would cause that issue, but I've let Alex know, just in case it is.

And I'm sorry to hear that it's happened again, but I'm glad to hear you were able to enable logging when this happened. That should definitely help identify why this was happening. And I've flagged the logs for Alex already.

And hopefully, this is an easy to find issue.

modplan · March 8, 2016

Modplan:

As for the deletion, I just wanted to make sure you knew the cases in which we do delete files from the provider, and the *only* cases which it should.

As for the memory test, this is to make sure that the "WriteMap" and other in memory objects aren't getting corrupted by bad memory. And since NTFS uses the memory extensively for caching, it's a good idea in general.

As for the diskpart stuff, was this for the CloudDrive disk, correct?

If so, well I don't think this would cause that issue, but I've let Alex know, just in case it is.

And I'm sorry to hear that it's happened again, but I'm glad to hear you were able to enable logging when this happened. That should definitely help identify why this was happening. And I've flagged the logs for Alex already.

And hopefully, this is an easy to find issue.

Hey Chris, yes I just wanted to be a thorough as possible in my response to aid in getting to the bottom of this.

Yes the readonly flag was set on the clouddrive drive itself, via diskpart. I agree, from my limited understanding that this shouldn't cause an issue, but just wanted to provide all the info I thought of.

Hopefully the logs tell the full story.

Christopher (Drashna) · March 9, 2016

Well, we appreciate that, as it make our jobs easier.

And yeah, the read only flag shouldn't affect it like this. At all.

However, are you using anything else that accesses your Google Drive account? If there is another program, it *could* be causing issues.

And right now, alex is attempting to repeat the the circumstances that lead up to the issue, so it may be a bit before we can successfully identify the case.

modplan · March 9, 2016

Nope nothing else really uses the account, Google Photos on my phone I guess would be about all. And as we can see in the screenshot, the app that initiated the delete was definitely CloudDrive. Glad Alex is taking a look!

Christopher (Drashna) · March 9, 2016

Okay, just wanted to be sure about that.

And yeah, that does appear to be the case.

And yeah, any sort of corruption issues like this, we definitely take seriously and try to look into. The biggest issues is reproducing it, and catching it while it happens. Otherwise, it's much harder to fix (or even identify)

modplan · March 9, 2016

Just for further info. The drive is now completely uploaded. "To Upload" = 0B

I was hoping at the end of the upload cycle CD may re-upload the chunks it previously deleted but that does not appear to be the case.

To Recap:

- ~775GB was copied to the cloud drive and we started uploading

- This specific cloud drive was then marked as read only with diskpart and has been that way the entire time since

- Some previously-uploaded chunks were deleted by clouddrive on 2 separate occasions during the upload cycle

- Each time I noticed deletions happening, upload was paused for a while, on resume, normal uploads, not further deletions, started

- We are now fully uploaded (several days later) and "To Upload" = 0B

All files still show up in windows explorer (pinned metadata I assume) but any file that had a chunk in the deleted range is now of course corrupt.

No significant errors were thrown by the CD GUI during this time.

I'm hopeful Alex can repro, please let me know if I can provide more info.

Christopher (Drashna) · March 10, 2016

Thanks for the additional feedback and clarification.

And hopefully he can, or at least track down what may be causing the issue.

modplan · March 11, 2016

Chris,

Saw Alex's update on the issue analysis. Unable to update there. Please let him know there might have been a power outage several days/a week before the deletes started happening. I know we had a power outage during a storm last week. What I can't remember is whether or not I had created this drive and started uploading yet when the outage happened.

If the power outage did happen after the drive was created, it would have been after all data was copied to the drive and the drive was marked RO, since I remember monitoring the copy operation and the outage was either before that and before the drive was created, or afterwards during the upload cycle, sorry that I can't pinpoint it exactly.

Not sure if that matters, but again, just trying to provide more info.

Christopher (Drashna) · March 12, 2016

I've let Alex know, but I'm pretty sure that a power outage would not have affected the pool in this way.

And unfortunately, Alex has been having issues reproducing the issue (even creating the same spec'ed drive, actually)

But he is definitely actively looking into the issue.

modplan · March 14, 2016

Alex have any luck root causing this yet? Anything I can do or any tests I can run? CloudDrive is currently sitting idle for me until this is resolved, since I don't want to dedicate days of uploading again to have the drive become worthless.

Willing to dedicate some time doing whatever I can to sort this out if Alex could use anything from me.

Christopher (Drashna) · March 15, 2016

He's still working on it. sorry.

However, he has found a few low level bugs because of this, and some of the integrity testing we've (he's) been doing (and some of this may be related to what you're seeing).

modplan · March 18, 2016

He's still working on it. sorry.

However, he has found a few low level bugs because of this, and some of the integrity testing we've (he's) been doing (and some of this may be related to what you're seeing).

Thanks Chris. So I assume then that while my logs uncovered some other bugs for Alex, the fixes in and up to .518 probably haven't solved this particular issue yet?

Christopher (Drashna) · March 19, 2016

Thanks Chris. So I assume then that while my logs uncovered some other bugs for Alex, the fixes in and up to .518 probably haven't solved this particular issue yet?

Not really.

However, in testing for this issue, and in starting to run integrity checks (check out the change logs for details), Alex did discover a number of bugs that ... well, could very well have caused this issue. (~50 builds that didn't come from user reported bugs)

Unfortunately, one of those required ... well, restarting the test from scratch. But he hasn't been able to reproduce the issue, unfortunately.

modplan · March 19, 2016

Thanks for the response Chris. Not sure I understand, likely my question wasn't clear haha. Do we think in the latest builds this issue is: Not Resolved? Possibly Resolved? Definitely Not Resolved?

If Alex hasn't been able to reproduce....maybe we have no clue and none of the above?

EDIT: Whoops never mind I think you are saying that Alex was on the road to reproducing this issue via test, hit some bugs that could have caused it, fixed some of those bugs, but had to restart the test and has not been able to repro it since? So we still aren't quite sure where we are regarding it? Anything I can do to test/help?

Christopher (Drashna) · March 19, 2016

Yes, that's about the gist of it.

He's still actively looking into the issue, and hoping to reproduce it (otherwise figuring out what is causing it is significantly more difficult)

And yes, along the way, he has ran into a few other issues that he's patched, that may help out or possibly fix it. But not sure.

modplan · March 22, 2016

FYI I'm impatient so I am retrying with a new disk on .533 with the same dataset. Even if I do not see the problem again, I wouldn't be 100% confident the problem is solved, but I'm willing to spend a few days uploading to find out.

My question is, should I turn on drive tracing for the entire process? My C: drive is a small SSD, will the logs get massive and use up a lot of space?

Christopher (Drashna) · March 23, 2016

Yes, absolutely do! And stop the logging as SOON as you notice it happening.

And no, it won't fill up the drive. The logs cap out at 20MBs, IIRC, and the older content gets pruned as soon as it exceeds that amount (so old data is purged to make room for the new logs). This is why it's important to disable it as soon as this happens, so that it doesn't prune anything important.

modplan · March 23, 2016

Thanks Chris, started the tracing. Not sure how quick I'll be able to catch it, uploading 24/7, but I'll keep an eye on it. No issues with the first 164GB so far.

Christopher (Drashna) · March 25, 2016

Well, hopefully, it just doesn't happen at all anymore!

Also, do let me know if there are any unexpected restarts (BSOD, reset button, CloudDrive service crashing, etc).

modplan · March 26, 2016

Unfortunately it happened again And it was over night (about 20 hours ago) so I am assuming the disk traces have wrapped?

Also it is worth noting that both times I was copying FROM the Google Drive Sync Folder TO Cloud Drive. I doubt this matters, but another data point. Again the range deleted was a range that was copied over and uploaded a day or so before. I'm going to run ExactFile on what I've copied so far to compare the MD5s and see exactly what has corrupted.

Christopher (Drashna) · March 28, 2016

Yeah, it would have wrapped, if it's been doing stuff afterwards.

And assuming 20MB chunk sizes, that's 14GBs of data, basically.

And since I'm assuming that you're not deleting contents from the CloudDrive disk...

Sign In

Corruption: Clouddrive just started mass deleting chunks

Question

modplan

39 answers to this question

Recommended Posts

Christopher (Drashna)

modplan

modplan

modplan

Christopher (Drashna)

modplan

Christopher (Drashna)

modplan

Christopher (Drashna)

modplan

Christopher (Drashna)

modplan

Christopher (Drashna)

modplan

Christopher (Drashna)

modplan

Christopher (Drashna)

modplan

Christopher (Drashna)

modplan

Christopher (Drashna)

modplan

Christopher (Drashna)

modplan

Christopher (Drashna)

Join the conversation

Browse

Activity