Just to recap, I've received a number of memory dumps over the past month or so showing a system crash in srv2.sys. Srv2.sys is Microsoft's file sharing driver that translates network file I/O requests into local file I/O requests on the server, but the implication is of course that StableBit DrivePool may somehow be the cause of these crashes. The crashes only occur on Windows 8 / Windows Server 2012 (including Essentials).
Paris has submitted the best dump to date on this issue and I've analyzed it in detail.
I'm going to post a technical description of the crash for anyone who can read this sort of thing.
What we have
A full memory dump of the crash.
Verifier enabled on all drivers at the time of the crash (giving us additional data about the crash).
ETW logging enabled on CoveFS / CoveFSDisk, giving us everything that CoveFS did right before the crash.
The system
3: kd> vertarget
Windows 8 Kernel Version 9200 MP (4 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
Built by: 9200.16581.amd64fre.win8_gdr.130410-1505
Kernel base = 0xfffff800`65214000 PsLoadedModuleList = 0xfffff800`654e0a20
Debug session time: Fri May 31 04:45:08.610 2013 (UTC - 4:00)
System Uptime: 0 days 0:21:01.550
3: kd> !sysinfo cpuinfo
[CPU Information]
~MHz = REG_DWORD 3100
Component Information = REG_BINARY 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Configuration Data = REG_FULL_RESOURCE_DESCRIPTOR ff,ff,ff,ff,ff,ff,ff,ff,0,0,0,0,0,0,0,0
Identifier = REG_SZ Intel64 Family 6 Model 58 Stepping 9
ProcessorNameString = REG_SZ Intel(R) Core(TM) i7-3770S CPU @ 3.10GHz
Update Status = REG_DWORD 2
VendorIdentifier = REG_SZ GenuineIntel
MSR8B = REG_QWORD ffffffff00000000
The crash
3: kd> !Analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except,
it must be protected by a Probe. Typically the address is just plain bad or it
is pointing at freed memory.
Arguments:
Arg1: fffffa8303aa41ea, memory referenced.
Arg2: 0000000000000000, value 0 = read operation, 1 = write operation.
Arg3: fffff8800556e328, If non-zero, the instruction address which referenced the bad memory
address.
Arg4: 0000000000000000, (reserved)
Debugging Details:
------------------
READ_ADDRESS: fffffa8303aa41ea Nonpaged pool
FAULTING_IP:
srv2!Smb2ContinueUncachedRead+26c28
fffff880`0556e328 410fb644240a movzx eax,byte ptr [r12+0Ah]
MM_INTERNAL_CODE: 0
IMAGE_NAME: srv2.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 51637dde
MODULE_NAME: srv2
FAULTING_MODULE: fffff880054ff000 srv2
DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT
BUGCHECK_STR: AV
PROCESS_NAME: System
CURRENT_IRQL: 0
TRAP_FRAME: fffff88004d3fb60 -- (.trap 0xfffff88004d3fb60)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffffa830396c150 rbx=0000000000000000 rcx=0000000000000000
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=fffff8800556e328 rsp=fffff88004d3fcf0 rbp=fffff9801e9e0af0
r8=00000000000006e5 r9=fffff88005544680 r10=fffffa83035a6c40
r11=0000000000000001 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei ng nz na po cy
srv2!Smb2ContinueUncachedRead+0x26c28:
fffff880`0556e328 410fb644240a movzx eax,byte ptr [r12+0Ah] ds:00000000`0000000a=??
Resetting default scope
LAST_CONTROL_TRANSFER: from fffff8006532f3f1 to fffff8006526e440
STACK_TEXT:
fffff880`04d3f978 fffff800`6532f3f1 : 00000000`00000050 fffffa83`03aa41ea 00000000`00000000 fffff880`04d3fb60 : nt!KeBugCheckEx
fffff880`04d3f980 fffff800`652a8acb : 00000000`00000000 fffffa83`03aa41ea fffffa83`035be040 fffff880`03cc4419 : nt! ?? ::FNODOBFM::`string'+0x33c2b
fffff880`04d3fa20 fffff800`6526beee : 00000000`00000000 fffff980`254b6950 fffff980`1e9e0b00 fffff880`04d3fb60 : nt!MmAccessFault+0x55b
fffff880`04d3fb60 fffff880`0556e328 : 00000000`00000000 fffff880`00000000 ffff2f2d`390b1a54 fffff980`01dc8f20 : nt!KiPageFault+0x16e
fffff880`04d3fcf0 fffff880`055470de : fffffa83`03c3d1e0 fffff980`01dc8f20 fffff980`254b6950 fffffa83`01f99040 : srv2!Smb2ContinueUncachedRead+0x26c28
fffff880`04d3fd50 fffff880`055455bd : 00000000`00000002 fffffa83`01f99040 fffff980`254b6c60 fffff800`6524acbe : srv2!Smb2ExecuteRead+0x6ce
fffff880`04d3fde0 fffff880`05545a64 : fffffa83`0084cd18 fffff980`254b6950 fffff980`1b44efd0 fffff980`254b6950 : srv2!Smb2ExecuteProviderCallback+0x6d
fffff880`04d3fe50 fffff880`05543180 : fffff980`1b54cd80 fffff980`1b54cd80 00000000`00000001 fffff980`254b6950 : srv2!SrvProcessPacket+0xed
fffff880`04d3ff10 fffff800`65268b27 : fffff880`04d3ff01 00000000`00000000 fffff980`254b6960 fffff880`05546000 : srv2!SrvProcpWorkerThreadProcessWorkItems+0x171
fffff880`04d3ff80 fffff800`65268aed : fffff980`1b54cd01 00000000`0000c000 00000000`00000003 fffff800`652c3ab8 : nt!KxSwitchKernelStackCallout+0x27
fffff880`07f9c9e0 fffff800`652c3ab8 : fffffa83`00000012 fffff980`1b54cd01 00000000`00000006 fffff880`07f97000 : nt!KiSwitchKernelStackContinue
fffff880`07f9ca00 fffff800`652c63f5 : fffff880`05543010 fffff980`1b54cd80 fffff980`1b54cd00 fffff980`00000000 : nt!KeExpandKernelStackAndCalloutInternal+0x218
fffff880`07f9cb00 fffff880`05500da5 : fffff980`1b54cd80 fffff980`1b54cd00 fffff980`1b54cd80 fffff800`65277cc4 : nt!KeExpandKernelStackAndCalloutEx+0x25
fffff880`07f9cb40 fffff800`652ac2b1 : fffffa83`035be040 fffff880`05546000 fffff980`1b54cde0 fffff900`00000000 : srv2!SrvProcWorkerThreadCommon+0x75
fffff880`07f9cb80 fffff800`65241045 : fffffa83`03058660 00000000`00000080 fffff800`652ac170 fffffa83`035be040 : nt!ExpWorkerThread+0x142
fffff880`07f9cc10 fffff800`652f5766 : fffff800`6550c180 fffffa83`035be040 fffff800`65566880 fffffa83`0088d980 : nt!PspSystemThreadStartup+0x59
fffff880`07f9cc60 00000000`00000000 : fffff880`07f9d000 fffff880`07f97000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16
STACK_COMMAND: kb
FOLLOWUP_IP:
srv2!Smb2ContinueUncachedRead+26c28
fffff880`0556e328 410fb644240a movzx eax,byte ptr [r12+0Ah]
SYMBOL_STACK_INDEX: 4
SYMBOL_NAME: srv2!Smb2ContinueUncachedRead+26c28
FOLLOWUP_NAME: MachineOwner
BUCKET_ID_FUNC_OFFSET: 26c28
FAILURE_BUCKET_ID: AV_VRF_srv2!Smb2ContinueUncachedRead
BUCKET_ID: AV_VRF_srv2!Smb2ContinueUncachedRead
Followup: MachineOwner
---------
From the auto analysis we can see that the memory address 0xfffffa8303aa41ea was being accessed from some code at address 0xfffff8800556e328.
We can also see that the function that crashed the system is srv2!Smb2ContinueUncachedRead.
We can check that memory address and it is indeed invalid:
The crash occurs at the point when the function tries to access the MdlFlags member of mdl1 (mdl1->MdlFlags). Since mdl1 points to an invalid memory address, we can't read the flags in.
The assembly instructions look like this:
srv2!Smb2ContinueUncachedRead+0x26c28:
fffff880`0556e328 410fb644240a movzx eax,byte ptr [r12+0Ah]
fffff880`0556e32e a804 test al,4
fffff880`0556e330 0f853294fdff jne srv2!Smb2ContinueUncachedRead+0x68 (fffff880`05547768)
r12 is mdl1, and we crash when trying to read in the flags.
The connection to Fast I/O
In every single crash dump that I've seen, the crash always occurs after a successful (non-waiting) Fast I/O read. In fact, the function that calls the crashing function (srv2!Smb2ExecuteRead+0x6ce) has an explicit condition to test for this.
Where did mdl1 go?
So the question is, why is mdl1 invalid? Did it exist before and was freed, or was there some kind of memory corruption?
Here are my observations on this:
In every dump that I've seen, the addresses look right. What I mean by that is that the seemingly invalid mdl1 address falls roughly into the same address range as mdl2. It always starts correctly and always ends with 1e0.
If this crash was due to faulty RAM, then I would expect to see this address fluctuate wildly.
The crash always occurs in the same place (plus or minus a few lines of code).
To me, this indicates that there is a bug somewhere.
Based on these observations I'm assuming that the mdl1 address in indeed valid, and so it must have been previously freed.
But who freed it?
We can answer that with a simple verifier query:
3: kd> !verifier 0x80 fffffa8303aa41e0
Log of recent kernel pool Allocate and Free operations:
There are up to 0x10000 entries in the log.
Parsing 0x0000000000010000 log entries, searching for address 0xfffffa8303aa41e0.
======================================================================
Pool block fffffa83`03aa3000, Size 00000000000018e0, Thread fffff80065566880
fffff80065864a32 nt!VfFreePoolNotification+0x4a
fffff80065486992 nt!ExFreePool+0x8a0
fffff80065855597 nt!VerifierExFreePoolWithTag+0x47
fffff880013b32bf vmbkmcl!VmbChannelPacketComplete+0x1df
fffff88003f91997 netvsc63!NvscMicroportCompleteMessage+0x67
fffff88003f916a3 netvsc63!ReceivePacketMessage+0x1e3
fffff88003f913ff netvsc63!NvscKmclProcessPacket+0x23f
fffff880013b2844 vmbkmcl!InpProcessQueue+0x164
fffff880013b402f vmbkmcl!InpFillAndProcessQueue+0x6f
fffff880013b7cb6 vmbkmcl! ?? ::FNODOBFM::`string'+0xb16
fffff880014790d7 vmbus!ChildInterruptDpc+0xc7
fffff80065296ca1 nt!KiExecuteAllDpcs+0x191
fffff800652968e0 nt!KiRetireDpcList+0xd0
======================================================================
Pool block fffffa8303aa3000, Size 00000000000018d0, Thread fffff80065566880
fffff80065855a5d nt!VeAllocatePoolWithTagPriority+0x2d1
fffff88001058665 VerifierExt!ExAllocatePoolWithTagPriority_internal_wrapper+0x49
fffff80065855f02 nt!VerifierExAllocatePoolEx+0x2a
fffff880013b2681 vmbkmcl!InpFillQueue+0x641
fffff880013b4004 vmbkmcl!InpFillAndProcessQueue+0x44
fffff880013b7cb6 vmbkmcl! ?? ::FNODOBFM::`string'+0xb16
fffff880014790d7 vmbus!ChildInterruptDpc+0xc7
fffff80065296ca1 nt!KiExecuteAllDpcs+0x191
fffff800652968e0 nt!KiRetireDpcList+0xd0
fffff800652979ba nt!KiIdleLoop+0x5a
======================================================================
The memory has been originally allocated by vmbkmcl.sys, and has already been freed at the point of the crash.
Googling, I found that vmbkmcl.sys is a "Hyper-V VMBus KMCL", and netvsc63.sys is the "Virtual NDIS6.3 Miniport".
File times
Here are the file times of the drivers that are involved in this complicated interaction.
Some memory was allocated to process a network request.
That memory was passed to srv2.sys, which is processing that request.
The original driver has decided that the memory is no longer needed and freed the memory.
srv2.sys is ignorantly trying to access the now freed memory.
Workarounds
As a potential workaround, turning off Fast I/O should prevent the code that is causing the problem from running.
DrivePool 2.0 doesn't yet contain a setting for this but I'll add it in the next build. Turning on Network I/O Boost should also prevent the problem because we do some extra processing on networked read requests when that is enabled, which bypasses Fast I/O.
Connection to DrivePool
I'm still trying to find a connection to DrivePool in all of this, but I can't. I still can't reproduce this crash on any of the test servers here (4 of them running the windows 8 kernel), nor can I reproduce this on any of my VMs (using VirtualBox).
Fast I/O doesn't deal with MDLs at all, so DrivePool never sees any of the variables that I've talked about here. The Fast I/O code in CoveFS is fairly short and easy to check.
Because of the potential Hyper-V connection shown above, I'll try to test for this on a Hyper-V VM.
As far as what DrivePool was doing right before the crash, I can see from the in-memory logs that it has just completed a Fast I/O read request and has successfully read in 262,144 bytes.
Because I don't have a definitive reproducible case, I can't be 100% certain as to what the issue is. I'll keep digging and will let you guys know if I find anything else.
Question
Alex
Continuing the thread from the old forum: http://forum.covecube.com/discussion/1129/critical-bsod-when-accessing-dp-over-a-network-share
Just to recap, I've received a number of memory dumps over the past month or so showing a system crash in srv2.sys. Srv2.sys is Microsoft's file sharing driver that translates network file I/O requests into local file I/O requests on the server, but the implication is of course that StableBit DrivePool may somehow be the cause of these crashes. The crashes only occur on Windows 8 / Windows Server 2012 (including Essentials).
Paris has submitted the best dump to date on this issue and I've analyzed it in detail.
I'm going to post a technical description of the crash for anyone who can read this sort of thing.
What we have
The system
The crash
From the auto analysis we can see that the memory address 0xfffffa8303aa41ea was being accessed from some code at address 0xfffff8800556e328.
We can also see that the function that crashed the system is srv2!Smb2ContinueUncachedRead.
We can check that memory address and it is indeed invalid:
What was srv2 trying to do?
So the next question to ask is what was srv2 trying to do and why did it fail?
I've gone ahead and decompiled the portion of srv2 that is causing the crash, and here it is:
A MDL is a kernel structure that simply describes some memory (http://msdn.microsoft.com/en-us/library/windows/hardware/ff554414(v=vs.85).aspx).
The MDL variables:
The crash occurs at the point when the function tries to access the MdlFlags member of mdl1 (mdl1->MdlFlags). Since mdl1 points to an invalid memory address, we can't read the flags in.
The assembly instructions look like this:
r12 is mdl1, and we crash when trying to read in the flags.
The connection to Fast I/O
In every single crash dump that I've seen, the crash always occurs after a successful (non-waiting) Fast I/O read. In fact, the function that calls the crashing function (srv2!Smb2ExecuteRead+0x6ce) has an explicit condition to test for this.
Where did mdl1 go?
So the question is, why is mdl1 invalid? Did it exist before and was freed, or was there some kind of memory corruption?
Here are my observations on this:
If this crash was due to faulty RAM, then I would expect to see this address fluctuate wildly.
To me, this indicates that there is a bug somewhere.
Based on these observations I'm assuming that the mdl1 address in indeed valid, and so it must have been previously freed.
But who freed it?
We can answer that with a simple verifier query:
The memory has been originally allocated by vmbkmcl.sys, and has already been freed at the point of the crash.
Googling, I found that vmbkmcl.sys is a "Hyper-V VMBus KMCL", and netvsc63.sys is the "Virtual NDIS6.3 Miniport".
File times
Here are the file times of the drivers that are involved in this complicated interaction.
Possible sequence of events
In short, it seems to me that:
Workarounds
As a potential workaround, turning off Fast I/O should prevent the code that is causing the problem from running.
DrivePool 2.0 doesn't yet contain a setting for this but I'll add it in the next build. Turning on Network I/O Boost should also prevent the problem because we do some extra processing on networked read requests when that is enabled, which bypasses Fast I/O.
Connection to DrivePool
I'm still trying to find a connection to DrivePool in all of this, but I can't. I still can't reproduce this crash on any of the test servers here (4 of them running the windows 8 kernel), nor can I reproduce this on any of my VMs (using VirtualBox).
Fast I/O doesn't deal with MDLs at all, so DrivePool never sees any of the variables that I've talked about here. The Fast I/O code in CoveFS is fairly short and easy to check.
Because of the potential Hyper-V connection shown above, I'll try to test for this on a Hyper-V VM.
As far as what DrivePool was doing right before the crash, I can see from the in-memory logs that it has just completed a Fast I/O read request and has successfully read in 262,144 bytes.
Because I don't have a definitive reproducible case, I can't be 100% certain as to what the issue is. I'll keep digging and will let you guys know if I find anything else.
Link to comment
Share on other sites
22 answers to this question
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.