I/O deadlock?

dragon2611 · June 27, 2015

Is it possible to get into a situation where the I/O seems to deadlock, I have a 2012r2 machine I was copying 300GB or so of data to onedrive Business including lots of small files,

It Looks like the copying has stopped but also I cannot bring up task manager and a shutdown -r -f -t 30 command is hanging. Hoping that it's cloud drive and not one of the disks in the box is failing.

Christopher (Drashna) · September 9, 2015

I am yes. Want the log file?

Yes, please:

http://wiki.covecube.com/StableBit_CloudDrive_Log_Collection

thnz · September 9, 2015

Done. Looks like the logfiles are capped at 20mb each and have been outputting lots of stuff. Hopefully the important stuff hasn't been cut off. The i/o errors started maybe 24 hours ago.

Kraevin · September 9, 2015

All of you are on the latest (1.0.0.378) build of StableBit CloudDrive, and are seeing this, correct?

Yes using 1.0.0.378

I can send the log file when i get home from work.

Edit: Uploading the log file. Also the speed is a little better but i am still getting the error as of 5:25 am est time. Which was 5 min ago.

triadcool · September 9, 2015

I'm on 1.0.0.378 as well. Was working fine for the first 8 hours then started erroring out.

thnz · September 9, 2015

Its sprung back to life over the past hour or two. The last I/O error was ~2hrs ago.

*edit* Has now sped up too.

thnz · September 9, 2015

I had a crash last night, though I'm unsure if it was related to CloudDrive or not, although CloudDrive is the only thing to have caused any crashes on that machine all year. I've got a memory dump, though I didn't see any BSOD messages as the screen wouldn't turn on. Looking at the event log, the last thing to happen was a wake by Windows to run scheduled maintenance tasks. The event log also mentions queuing the update to Windows 10 (the machine is currently running 8.1), though I wouldn't have thought that would have started without user intervention first, so might just be incidental. I'll submit the dump in a bit, so hopefully that'll show whether its CloudDrive related or not.

Also, after a few solid hours last night, it looks like the I/O errors are back again and speed has slowed right down.

Christopher (Drashna) · September 10, 2015

Thanks. I've flagged the issue for Alex, here:

https://stablebit.com/Admin/IssueAnalysis/20788

Kraevin · September 12, 2015

Any updates? I noticed today the errors are lower and uploads speeds are a little better which makes me wonder if this could maybe be due to heavy traffic are Amazons end. Or them limiting it somehow.

Of course I could be totally wtong lol

Any updates? I noticed today the errors are lower and uploads speeds are a little better which makes me wonder if this could maybe be due to heavy traffic are Amazons end. Or them limiting it somehow.

Of course I could be totally wtong lol

thnz · September 12, 2015

I'm assuming its either some kind of throttling, or CloudDrive erroneously aborting something that was still active as resource monitor shows plenty of network activity while CloudDrive shows very slow progress. It seems to show up regularly (~2m intervals) during active uploads when it happens. Perhaps its left over from the measures in place to counter throttling in v1?

That's my random, totally uninformed guesswork anyway!

thnz · September 13, 2015

I had a crash last night, though I'm unsure if it was related to CloudDrive or not, although CloudDrive is the only thing to have caused any crashes on that machine all year. I've got a memory dump, though I didn't see any BSOD messages as the screen wouldn't turn on. Looking at the event log, the last thing to happen was a wake by Windows to run scheduled maintenance tasks. The event log also mentions queuing the update to Windows 10 (the machine is currently running 8.1), though I wouldn't have thought that would have started without user intervention first, so might just be incidental. I'll submit the dump in a bit, so hopefully that'll show whether its CloudDrive related or not.

To add to this, there appears to have been some data loss when this occurred, similar to what would happen after restarting from the I/O deadlock issue. A group of sequentially named files have become corrupted - ie. they were copied to the disk in order and I guess the corruption likely happened on the chunks involved. The chunks would have been being uploaded at the time of the crash.

*Edit* I can reproduce this on another machine. Simply copy a folder to the cloud disk, say a video folder or something with a good selection of reasonably sized files. Hard restart the computer while the data is uploading. After recovery, there will be a group of sequentially named corrupt files. They stand out as they fail to generate thumbnails, and when checked their hashes differ from the original files.

Christopher (Drashna) · September 16, 2015

To add to this, there appears to have been some data loss when this occurred, similar to what would happen after restarting from the I/O deadlock issue. A group of sequentially named files have become corrupted - ie. they were copied to the disk in order and I guess the corruption likely happened on the chunks involved. The chunks would have been being uploaded at the time of the crash.

*Edit* I can reproduce this on another machine. Simply copy a folder to the cloud disk, say a video folder or something with a good selection of reasonably sized files. Hard restart the computer while the data is uploading. After recovery, there will be a group of sequentially named corrupt files. They stand out as they fail to generate thumbnails, and when checked their hashes differ from the original files.

This drive was created with an older version, right?

If so .... yeah, Amazon has started doing some autodetection of file types based on the contents (beginnings of the file), despite the fact that we use the API to tell it that it's binary data. This means that the files do get uploaded, but ..... are not properly available to be retrieved.

So, ..... the best option is to download the latest build, delete the old drive, and create a new one.

Yes, this is BS, but this isn't anything we can easily fix. The fix for this ..... is encrypting all the data for every disk using the Amazon Cloud Drive provider, and prepending null characters (for the probability that a random file will still match a file type). This unfortunately means that the provider is not backwards compatible.

And yes, we're not happy about having to do this (or how Amazon is handling this, or their response time to replying to our emails). "disappointing" is the nice way to put how we feel (Alex is going to be writing a nice long post about everything that we've gone through with this provider, though, you guys probably have a good idea (I haven't exactly been very quiet about it).

thnz · September 16, 2015

It happened on a new Amazon Cloud Drive, created using .378. It also happens on other providers though - both encrypted and unencrypted - I've tested on DropBox with the same result - so isn't Amazon specific. It's much easier to reproduce by copying across a single large file, and hard restarting once its finished copying and is well into uploading - when uploading a group of files, it shows the corruption occurs in several places as it effects several files at once. It doesn't allways happen though - it might take a couple of hard restarts before the corruption occurs.

https://imgur.com/sPTy1ip

Kraevin · September 16, 2015

It happened on a new Amazon Cloud Drive, created using .378. It also happens on other providers though - both encrypted and unencrypted - I've tested on DropBox with the same result - so isn't Amazon specific. It's much easier to reproduce by copying across a single large file, and hard restarting once its finished copying and is well into uploading - when uploading a group of files, it shows the corruption occurs in several places as it effects several files at once. It doesn't allways happen though - it might take a couple of hard restarts before the corruption occurs.

https://imgur.com/sPTy1ip

I had this happen as well when windows did a update and restarted, this was with version .376 though, i have not had the computer do a restart while it was uploading since trying out .378

I ended up having to destroy the drive and start new, which was no problem since it was being used mainly for testing anyway. That's what beta is for right =)

Christopher (Drashna) · September 16, 2015

It happened on a new Amazon Cloud Drive, created using .378. It also happens on other providers though - both encrypted and unencrypted - I've tested on DropBox with the same result - so isn't Amazon specific. It's much easier to reproduce by copying across a single large file, and hard restarting once its finished copying and is well into uploading - when uploading a group of files, it shows the corruption occurs in several places as it effects several files at once. It doesn't allways happen though - it might take a couple of hard restarts before the corruption occurs.

https://imgur.com/sPTy1ip

Could you enable logging and reproduce?

http://wiki.covecube.com/StableBit_CloudDrive_Log_Collection

thnz · September 16, 2015

It took several attempts, but I finally reproduced it again after lowering the local cache size to 20MB (though that could just be coincidence). 'Drive tracing' seems to have turned itself off after the hard restart, but hopefully it caught enough to be helpful. I've uploaded it via the form on that log collection page.

Quick summary of how I reproduced it:

New 1GB unencrypted drive on DropBox with 20MB cache
Copied ~700mb file across
Hard reset after the file finished copying (ie. disk activity in resource monitor had finished), but still uploading (was maybe 50mb into the upload)
File is now corrupt (has different hash) after drive recovers

Just want to add, that its times like this (constant restarting) that you really appreciate having an SSD. Reboots so fast!

Christopher (Drashna) · September 17, 2015

To clarify, it still has data it's uploading, when it resets?

thnz · September 17, 2015

Data loss has a chance to occur if you manually hard restart the computer (or something else crashes it/lose power etc) while CloudDrive is uploading data. CloudDrive itself doesn't cause a crash or restart. Data loss doesn't happen every time.

Christopher (Drashna) · September 19, 2015

Data loss has a chance to occur if you manually hard restart the computer (or something else crashes it/lose power etc) while CloudDrive is uploading data. CloudDrive itself doesn't cause a crash or restart. Data loss doesn't happen every time.

Sorry, but in your case, is it losing power or hard resetting? Or is it doing a proper, graceful shutdown?

And any time the system hard resets/loses power, data corruption is always a possibility.

thnz · September 19, 2015

Data loss happened when restarting (manual power cycle as the system was unresponsive) after the system crash in post #107 (which hasn't happened again since) although its very possible that the initial crash wasnt even clouddrive related. I then found that by manually hard restarting (ie restart button on case - a non graceful restart) I could reproduce the data loss on data that was previously 'written' to the cloud drive and was currently being uploaded. In this instance, data loss only occurred the once following a system crash and non-graceful restart - the rest was me manually non-gracefully restarting in order to reproduce it.

I understand that data loss can be expected on power loss, or a non graceful shutdown - especially on active writes - and thankfully they happen very rarely. However seeing as the data had previously been written a good day or two prior (although still uploading), I thought it reasonable for CloudDrive to be able to successfully recover and continue uploading from where it left off.

Also this might well be the same thing seen back in posts #19-#20.

Hopefully that clears things up

triadcool · September 19, 2015

Should we just stop using this product until a new version is released? Do we risk losing our current stablebit drives due to being incompatible with any future updates?

This is verison 1.0.0.378

Christopher (Drashna) · September 20, 2015

Data loss happened when restarting (manual power cycle as the system was unresponsive) after the system crash in post #107 (which hasn't happened again since) although its very possible that the initial crash wasnt even clouddrive related. I then found that by manually hard restarting (ie restart button on case - a non graceful restart) I could reproduce the data loss on data that was previously 'written' to the cloud drive and was currently being uploaded. In this instance, data loss only occurred the once following a system crash and non-graceful restart - the rest was me manually non-gracefully restarting in order to reproduce it.

I understand that data loss can be expected on power loss, or a non graceful shutdown - especially on active writes - and thankfully they happen very rarely. However seeing as the data had previously been written a good day or two prior (although still uploading), I thought it reasonable for CloudDrive to be able to successfully recover and continue uploading from where it left off.

Also this might well be the same thing seen back in posts #19-#20.

Hopefully that clears things up

Thank you. That definitely does clear it up, and I'll let Alex know.

I know that Alex did to a lot of testing when it came to "graceless"? shutdowns during development, specifically to prevent any sort of corruption issues.

But this could be caused by a change in the code (since the deadlock issue was a big one, it' wouldn't surprise me).

Should we just stop using this product until a new version is released? Do we risk losing our current stablebit drives due to being incompatible with any future updates?

This is verison 1.0.0.378

Well, backwards compatibility is important for us.

As for a new version, that really depends. The newer version of the Amazon Cloud Drive provider should be stable and reliably (at least as reliable as the backend on Amazon's side).

As for the errors, is this with a newly created cloud drive or one from an older version?

I ask, because we do have two different ones now (a depreciated one and the newer one).

triadcool · September 20, 2015

Thank you. That definitely does clear it up, and I'll let Alex know.

I know that Alex did to a lot of testing when it came to "graceless"? shutdowns during development, specifically to prevent any sort of corruption issues.

But this could be caused by a change in the code (since the deadlock issue was a big one, it' wouldn't surprise me).

Well, backwards compatibility is important for us.

As for a new version, that really depends. The newer version of the Amazon Cloud Drive provider should be stable and reliably (at least as reliable as the backend on Amazon's side).

As for the errors, is this with a newly created cloud drive or one from an older version?

I ask, because we do have two different ones now (a depreciated one and the newer one).

This is happening with a freshly created drive with the newest application version.

thnz · September 21, 2015

FWIW AWS had a big outage earlier in the day, though is now apparently fixed - I'm not sure if this effected Amazon Cloud Drive, though as Amazon CloudDrive is throwing errors (even in a web browser) I assume it did (and still is).

Christopher (Drashna) · September 22, 2015

Should we just stop using this product until a new version is released? Do we risk losing our current stablebit drives due to being incompatible with any future updates?

This is verison 1.0.0.378

Clarification for this.... this is due to Amazon throttling the CloudDrive connection.

They finally have responded to some of our emails, and their response was basically "Thanks, we'll take this under advisement" (that's the TL;DR of it).

FWIW AWS had a big outage earlier in the day, though is now apparently fixed - I'm not sure if this effected Amazon Cloud Drive, though as Amazon CloudDrive is throwing errors (even in a web browser) I assume it did (and still is).

More than likely, yeah. Affected Netflix as well..... So, "ouch".

And yeah, they finally responded to some of our emails. Probably because they literally couldn't do anything else, during the outage....

But basically "We'll take the information into consideration", in regards to some of the networking and technical issues. Boilerplate responses.....

This is happening with a freshly created drive with the newest application version.

Okay, I'll see about duplicating this, and will let Alex know.

If it's very reproducable, it's an issue. Especially if it's with other providers as well.

thnz · September 22, 2015

Is this also throttling? Its been sitting like this all night with no progress.

Logfile is full of

CloudDrive.Service.exe	Warning	0	[WholeChunkIoImplementation] Error on read when performing master partial write. HTTP Status BadRequest	2015-09-22 21:01:07Z	216674686527
CloudDrive.Service.exe	Warning	0	[WholeChunkIoImplementation] Error on read when performing shared partial write. HTTP Status BadRequest	2015-09-22 21:01:07Z	216674806938
CloudDrive.Service.exe	Warning	0	[WholeChunkIoImplementation] Error on read when performing shared partial write. HTTP Status BadRequest	2015-09-22 21:01:07Z	216674809296
CloudDrive.Service.exe	Warning	0	[IoManager] Error performing I/O operation on provider. Retrying. HTTP Status BadRequest	2015-09-22 21:01:07Z	216674811931
CloudDrive.Service.exe	Warning	0	[WholeChunkIoImplementation] Error on read when performing shared partial write. HTTP Status BadRequest	2015-09-22 21:01:07Z	216674814877
CloudDrive.Service.exe	Warning	0	[WholeChunkIoImplementation] Error on read when performing shared partial write. HTTP Status BadRequest	2015-09-22 21:01:07Z	216674817146
CloudDrive.Service.exe	Warning	0	[IoManager] Error performing I/O operation on provider. Retrying. HTTP Status BadRequest	2015-09-22 21:01:07Z	216674820551
CloudDrive.Service.exe	Warning	0	[IoManager] Error performing I/O operation on provider. Retrying. HTTP Status BadRequest	2015-09-22 21:01:07Z	216674822976
CloudDrive.Service.exe	Warning	0	[IoManager] Error performing I/O operation on provider. Retrying. HTTP Status BadRequest	2015-09-22 21:01:07Z	216674825401
CloudDrive.Service.exe	Warning	0	[IoManager] Error performing I/O operation on provider. Retrying. HTTP Status BadRequest	2015-09-22 21:01:07Z	216674827225

Is a disk on Amazon CloudDrive.

Sign In

I/O deadlock?

Question

Top Posters For This Question

Popular Days

Top Posters For This Question

Popular Days

Popular Posts

Christopher (Drashna)

Christopher (Drashna)

Christopher (Drashna)

Posted Images

130 answers to this question

Recommended Posts

Join the conversation

Announcements