BSOD in srv2.sys on Windows 8 / 2012

Alex · May 31, 2013

Continuing the thread from the old forum: http://forum.covecube.com/discussion/1129/critical-bsod-when-accessing-dp-over-a-network-share

Just to recap, I've received a number of memory dumps over the past month or so showing a system crash in srv2.sys. Srv2.sys is Microsoft's file sharing driver that translates network file I/O requests into local file I/O requests on the server, but the implication is of course that StableBit DrivePool may somehow be the cause of these crashes. The crashes only occur on Windows 8 / Windows Server 2012 (including Essentials).

Paris has submitted the best dump to date on this issue and I've analyzed it in detail.

I'm going to post a technical description of the crash for anyone who can read this sort of thing.

What we have

A full memory dump of the crash.
Verifier enabled on all drivers at the time of the crash (giving us additional data about the crash).
ETW logging enabled on CoveFS / CoveFSDisk, giving us everything that CoveFS did right before the crash.

The system

3: kd> vertarget
Windows 8 Kernel Version 9200 MP (4 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
Built by: 9200.16581.amd64fre.win8_gdr.130410-1505
Kernel base = 0xfffff800`65214000 PsLoadedModuleList = 0xfffff800`654e0a20
Debug session time: Fri May 31 04:45:08.610 2013 (UTC - 4:00)
System Uptime: 0 days 0:21:01.550

3: kd> !sysinfo cpuinfo
[CPU Information]
~MHz = REG_DWORD 3100
Component Information = REG_BINARY 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Configuration Data = REG_FULL_RESOURCE_DESCRIPTOR ff,ff,ff,ff,ff,ff,ff,ff,0,0,0,0,0,0,0,0
Identifier = REG_SZ Intel64 Family 6 Model 58 Stepping 9
ProcessorNameString = REG_SZ Intel(R) Core(TM) i7-3770S CPU @ 3.10GHz
Update Status = REG_DWORD 2
VendorIdentifier = REG_SZ GenuineIntel
MSR8B = REG_QWORD ffffffff00000000

The crash

3: kd> !Analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced.  This cannot be protected by try-except,
it must be protected by a Probe.  Typically the address is just plain bad or it
is pointing at freed memory.
Arguments:
Arg1: fffffa8303aa41ea, memory referenced.
Arg2: 0000000000000000, value 0 = read operation, 1 = write operation.
Arg3: fffff8800556e328, If non-zero, the instruction address which referenced the bad memory
	address.
Arg4: 0000000000000000, (reserved)

Debugging Details:
------------------


READ_ADDRESS:  fffffa8303aa41ea Nonpaged pool

FAULTING_IP: 
srv2!Smb2ContinueUncachedRead+26c28
fffff880`0556e328 410fb644240a    movzx   eax,byte ptr [r12+0Ah]

MM_INTERNAL_CODE:  0

IMAGE_NAME:  srv2.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  51637dde

MODULE_NAME: srv2

FAULTING_MODULE: fffff880054ff000 srv2

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

BUGCHECK_STR:  AV

PROCESS_NAME:  System

CURRENT_IRQL:  0

TRAP_FRAME:  fffff88004d3fb60 -- (.trap 0xfffff88004d3fb60)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffffa830396c150 rbx=0000000000000000 rcx=0000000000000000
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=fffff8800556e328 rsp=fffff88004d3fcf0 rbp=fffff9801e9e0af0
 r8=00000000000006e5  r9=fffff88005544680 r10=fffffa83035a6c40
r11=0000000000000001 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na po cy
srv2!Smb2ContinueUncachedRead+0x26c28:
fffff880`0556e328 410fb644240a    movzx   eax,byte ptr [r12+0Ah] ds:00000000`0000000a=??
Resetting default scope

LAST_CONTROL_TRANSFER:  from fffff8006532f3f1 to fffff8006526e440

STACK_TEXT:  
fffff880`04d3f978 fffff800`6532f3f1 : 00000000`00000050 fffffa83`03aa41ea 00000000`00000000 fffff880`04d3fb60 : nt!KeBugCheckEx
fffff880`04d3f980 fffff800`652a8acb : 00000000`00000000 fffffa83`03aa41ea fffffa83`035be040 fffff880`03cc4419 : nt! ?? ::FNODOBFM::`string'+0x33c2b
fffff880`04d3fa20 fffff800`6526beee : 00000000`00000000 fffff980`254b6950 fffff980`1e9e0b00 fffff880`04d3fb60 : nt!MmAccessFault+0x55b
fffff880`04d3fb60 fffff880`0556e328 : 00000000`00000000 fffff880`00000000 ffff2f2d`390b1a54 fffff980`01dc8f20 : nt!KiPageFault+0x16e
fffff880`04d3fcf0 fffff880`055470de : fffffa83`03c3d1e0 fffff980`01dc8f20 fffff980`254b6950 fffffa83`01f99040 : srv2!Smb2ContinueUncachedRead+0x26c28
fffff880`04d3fd50 fffff880`055455bd : 00000000`00000002 fffffa83`01f99040 fffff980`254b6c60 fffff800`6524acbe : srv2!Smb2ExecuteRead+0x6ce
fffff880`04d3fde0 fffff880`05545a64 : fffffa83`0084cd18 fffff980`254b6950 fffff980`1b44efd0 fffff980`254b6950 : srv2!Smb2ExecuteProviderCallback+0x6d
fffff880`04d3fe50 fffff880`05543180 : fffff980`1b54cd80 fffff980`1b54cd80 00000000`00000001 fffff980`254b6950 : srv2!SrvProcessPacket+0xed
fffff880`04d3ff10 fffff800`65268b27 : fffff880`04d3ff01 00000000`00000000 fffff980`254b6960 fffff880`05546000 : srv2!SrvProcpWorkerThreadProcessWorkItems+0x171
fffff880`04d3ff80 fffff800`65268aed : fffff980`1b54cd01 00000000`0000c000 00000000`00000003 fffff800`652c3ab8 : nt!KxSwitchKernelStackCallout+0x27
fffff880`07f9c9e0 fffff800`652c3ab8 : fffffa83`00000012 fffff980`1b54cd01 00000000`00000006 fffff880`07f97000 : nt!KiSwitchKernelStackContinue
fffff880`07f9ca00 fffff800`652c63f5 : fffff880`05543010 fffff980`1b54cd80 fffff980`1b54cd00 fffff980`00000000 : nt!KeExpandKernelStackAndCalloutInternal+0x218
fffff880`07f9cb00 fffff880`05500da5 : fffff980`1b54cd80 fffff980`1b54cd00 fffff980`1b54cd80 fffff800`65277cc4 : nt!KeExpandKernelStackAndCalloutEx+0x25
fffff880`07f9cb40 fffff800`652ac2b1 : fffffa83`035be040 fffff880`05546000 fffff980`1b54cde0 fffff900`00000000 : srv2!SrvProcWorkerThreadCommon+0x75
fffff880`07f9cb80 fffff800`65241045 : fffffa83`03058660 00000000`00000080 fffff800`652ac170 fffffa83`035be040 : nt!ExpWorkerThread+0x142
fffff880`07f9cc10 fffff800`652f5766 : fffff800`6550c180 fffffa83`035be040 fffff800`65566880 fffffa83`0088d980 : nt!PspSystemThreadStartup+0x59
fffff880`07f9cc60 00000000`00000000 : fffff880`07f9d000 fffff880`07f97000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16


STACK_COMMAND:  kb

FOLLOWUP_IP: 
srv2!Smb2ContinueUncachedRead+26c28
fffff880`0556e328 410fb644240a    movzx   eax,byte ptr [r12+0Ah]

SYMBOL_STACK_INDEX:  4

SYMBOL_NAME:  srv2!Smb2ContinueUncachedRead+26c28

FOLLOWUP_NAME:  MachineOwner

BUCKET_ID_FUNC_OFFSET:  26c28

FAILURE_BUCKET_ID:  AV_VRF_srv2!Smb2ContinueUncachedRead

BUCKET_ID:  AV_VRF_srv2!Smb2ContinueUncachedRead

Followup: MachineOwner
---------

From the auto analysis we can see that the memory address 0xfffffa8303aa41ea was being accessed from some code at address 0xfffff8800556e328.

We can also see that the function that crashed the system is srv2!Smb2ContinueUncachedRead.

We can check that memory address and it is indeed invalid:

3: kd> dd 0xfffffa8303aa41ea
fffffa83`03aa41ea  ???????? ???????? ???????? ????????
fffffa83`03aa41fa  ???????? ???????? ???????? ????????
fffffa83`03aa420a  ???????? ???????? ???????? ????????
fffffa83`03aa421a  ???????? ???????? ???????? ????????
fffffa83`03aa422a  ???????? ???????? ???????? ????????
fffffa83`03aa423a  ???????? ???????? ???????? ????????
fffffa83`03aa424a  ???????? ???????? ???????? ????????
fffffa83`03aa425a  ???????? ???????? ???????? ????????

What was srv2 trying to do?

So the next question to ask is what was srv2 trying to do and why did it fail?

I've gone ahead and decompiled the portion of srv2 that is causing the crash, and here it is:

  if ( mdl1 && mdl1 != mdl2 && !(mdl1->MdlFlags & MDL_SOURCE_IS_NONPAGED_POOL) )
  {
    do
    {
      mdlCurrent = mdl1;
      mdlFlags = mdl1->MdlFlags;
      mdl1 = mdl1->Next;
      if ( mdlFlags & (MDL_PARTIAL_HAS_BEEN_MAPPED | MDL_PAGES_LOCKED) )
      {
        MmUnlockPages(mdlCurrent);
      }
      IoFreeMdl(mdlCurrent);
    }
    while ( mdl1 );
    *(_DWORD *)(Length + 4) = 0;
  }

A MDL is a kernel structure that simply describes some memory (http://msdn.microsoft.com/en-us/library/windows/hardware/ff554414(v=vs.85).aspx).

The MDL variables:

mdl1: 0xfffffa83`03aa41e0 (invalid memory pointer)

mdl2: 0xfffffa83`03c3d1e0

3: kd> dt nt!_MDL fffffa83`03c3d1e0
   +0x000 Next             : (null) 
   +0x008 Size             : 0n568
   +0x00a MdlFlags         : 0n4
   +0x00c AllocationProcessorNumber : 0xf980
   +0x00e Reserved         : 0xffff
   +0x010 Process          : (null) 
   +0x018 MappedSystemVa   : 0xfffffa83`03bfd000 Void
   +0x020 StartVa          : 0xfffffa83`03bfd000 Void
   +0x028 ByteCount        : 0x40150
   +0x02c ByteOffset       : 0

The crash occurs at the point when the function tries to access the MdlFlags member of mdl1 (mdl1->MdlFlags). Since mdl1 points to an invalid memory address, we can't read the flags in.

The assembly instructions look like this:

srv2!Smb2ContinueUncachedRead+0x26c28:
fffff880`0556e328 410fb644240a    movzx   eax,byte ptr [r12+0Ah]
fffff880`0556e32e a804            test    al,4
fffff880`0556e330 0f853294fdff    jne     srv2!Smb2ContinueUncachedRead+0x68 (fffff880`05547768)

r12 is mdl1, and we crash when trying to read in the flags.

The connection to Fast I/O

In every single crash dump that I've seen, the crash always occurs after a successful (non-waiting) Fast I/O read. In fact, the function that calls the crashing function (srv2!Smb2ExecuteRead+0x6ce) has an explicit condition to test for this.

Where did mdl1 go?

So the question is, why is mdl1 invalid? Did it exist before and was freed, or was there some kind of memory corruption?

Here are my observations on this:

In every dump that I've seen, the addresses look right. What I mean by that is that the seemingly invalid mdl1 address falls roughly into the same address range as mdl2. It always starts correctly and always ends with 1e0.

If this crash was due to faulty RAM, then I would expect to see this address fluctuate wildly.
The crash always occurs in the same place (plus or minus a few lines of code).

To me, this indicates that there is a bug somewhere.

Based on these observations I'm assuming that the mdl1 address in indeed valid, and so it must have been previously freed.

But who freed it?

We can answer that with a simple verifier query:

3: kd> !verifier 0x80 fffffa8303aa41e0

Log of recent kernel pool Allocate and Free operations:

There are up to 0x10000 entries in the log.

Parsing 0x0000000000010000 log entries, searching for address 0xfffffa8303aa41e0.


======================================================================
Pool block fffffa83`03aa3000, Size 00000000000018e0, Thread fffff80065566880
fffff80065864a32 nt!VfFreePoolNotification+0x4a
fffff80065486992 nt!ExFreePool+0x8a0
fffff80065855597 nt!VerifierExFreePoolWithTag+0x47
fffff880013b32bf vmbkmcl!VmbChannelPacketComplete+0x1df
fffff88003f91997 netvsc63!NvscMicroportCompleteMessage+0x67
fffff88003f916a3 netvsc63!ReceivePacketMessage+0x1e3
fffff88003f913ff netvsc63!NvscKmclProcessPacket+0x23f
fffff880013b2844 vmbkmcl!InpProcessQueue+0x164
fffff880013b402f vmbkmcl!InpFillAndProcessQueue+0x6f
fffff880013b7cb6 vmbkmcl! ?? ::FNODOBFM::`string'+0xb16
fffff880014790d7 vmbus!ChildInterruptDpc+0xc7
fffff80065296ca1 nt!KiExecuteAllDpcs+0x191
fffff800652968e0 nt!KiRetireDpcList+0xd0
======================================================================
Pool block fffffa8303aa3000, Size 00000000000018d0, Thread fffff80065566880
fffff80065855a5d nt!VeAllocatePoolWithTagPriority+0x2d1
fffff88001058665 VerifierExt!ExAllocatePoolWithTagPriority_internal_wrapper+0x49
fffff80065855f02 nt!VerifierExAllocatePoolEx+0x2a
fffff880013b2681 vmbkmcl!InpFillQueue+0x641
fffff880013b4004 vmbkmcl!InpFillAndProcessQueue+0x44
fffff880013b7cb6 vmbkmcl! ?? ::FNODOBFM::`string'+0xb16
fffff880014790d7 vmbus!ChildInterruptDpc+0xc7
fffff80065296ca1 nt!KiExecuteAllDpcs+0x191
fffff800652968e0 nt!KiRetireDpcList+0xd0
fffff800652979ba nt!KiIdleLoop+0x5a
======================================================================

The memory has been originally allocated by vmbkmcl.sys, and has already been freed at the point of the crash.

Googling, I found that vmbkmcl.sys is a "Hyper-V VMBus KMCL", and netvsc63.sys is the "Virtual NDIS6.3 Miniport".

File times

Here are the file times of the drivers that are involved in this complicated interaction.

3: kd> lmvm srv2
start             end                 module name
fffff880`054ff000 fffff880`055a0000   srv2       (private pdb symbols)  c:\symbols\srv2.pdb\B796522F4D804083998D25552950C4202\srv2.pdb
    Loaded symbol image file: srv2.sys
    Image path: \SystemRoot\System32\DRIVERS\srv2.sys
    Image name: srv2.sys
    Timestamp:        Mon Apr 08 22:33:02 2013 (51637DDE)
    CheckSum:         000A6B64
    ImageSize:        000A1000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4
3: kd> lmvm vmbkmcl
start             end                 module name
fffff880`013b0000 fffff880`013c6000   vmbkmcl    (pdb symbols)          c:\symbols\vmbkmcl.pdb\82188957E5784EDD91906B760767302E1\vmbkmcl.pdb
    Loaded symbol image file: vmbkmcl.sys
    Image path: \SystemRoot\System32\drivers\vmbkmcl.sys
    Image name: vmbkmcl.sys
    Timestamp:        Wed Jul 25 22:28:33 2012 (5010AB51)
    CheckSum:         000250C9
    ImageSize:        00016000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4
3: kd> lmvm netvsc63
start             end                 module name
fffff880`03f90000 fffff880`03faa000   netvsc63   (private pdb symbols)  c:\symbols\netvsc63.pdb\BD38B199A4C94771860A5F2390CC30E61\netvsc63.pdb
    Loaded symbol image file: netvsc63.sys
    Image path: netvsc63.sys
    Image name: netvsc63.sys
    Timestamp:        Sat Feb 02 02:23:05 2013 (510CBED9)
    CheckSum:         0001B2D9
    ImageSize:        0001A000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

Possible sequence of events

In short, it seems to me that:

Some memory was allocated to process a network request.
That memory was passed to srv2.sys, which is processing that request.
The original driver has decided that the memory is no longer needed and freed the memory.
srv2.sys is ignorantly trying to access the now freed memory.

Workarounds

As a potential workaround, turning off Fast I/O should prevent the code that is causing the problem from running.

DrivePool 2.0 doesn't yet contain a setting for this but I'll add it in the next build. Turning on Network I/O Boost should also prevent the problem because we do some extra processing on networked read requests when that is enabled, which bypasses Fast I/O.

Connection to DrivePool

I'm still trying to find a connection to DrivePool in all of this, but I can't. I still can't reproduce this crash on any of the test servers here (4 of them running the windows 8 kernel), nor can I reproduce this on any of my VMs (using VirtualBox).

Fast I/O doesn't deal with MDLs at all, so DrivePool never sees any of the variables that I've talked about here. The Fast I/O code in CoveFS is fairly short and easy to check.

Because of the potential Hyper-V connection shown above, I'll try to test for this on a Hyper-V VM.

As far as what DrivePool was doing right before the crash, I can see from the in-memory logs that it has just completed a Fast I/O read request and has successfully read in 262,144 bytes.

Because I don't have a definitive reproducible case, I can't be 100% certain as to what the issue is. I'll keep digging and will let you guys know if I find anything else.

DrParis · May 31, 2013

Hi Alex, thank you for the feedback. Very interesting and instructive.

I strongly suggest that you send a link to this post to Microsoft guys so that they can check that...

Regarding the link to an Hyper-V driver, it is perhaps just a coincidence.

I am running DP in a VM for the moment to limit the impact of the BSOD on my system.

ASAP I will reactivate DP on my host OS and generate for you another full dump, this time with my real hardware drivers and not the HyperV ones.

Good news, it really seems that the Net IO boost prevents the BSODs.

Still testing that but for the moment seems ok.

I have two questions for you:

- what is the performance impact of this option? (speed up net accesses? consume CPU?)

- what would be the impact of turning of Fast IO as you suggest?

DrParis · May 31, 2013

And another dev question.

When you say that "The crash occurs at the point when the function tries to access the MdlFlags member of mdl1", are you talking about the access in the if test or the do...while loop?

In other terms, is the first mdl causing the crash, or one in the linked list?

Samael · June 1, 2013

Thanks for the detailed analysis, Alex. I do have 2 Hyper-V instances running, with DP running on the host OS. I will point out that I do have Network I/O boost enabled on my setup, and I do experience crashes. I'd be curious to see if DrParis continues to not experience issues with his Network I/O Boost settings once he's running it from his host again. I'm also curious to see what the impact of disabling Fast I/O will be.

Alex · June 1, 2013

I have two questions for you:

- what is the performance impact of this option? (speed up net accesses? consume CPU?)

- what would be the impact of turning of Fast IO as you suggest?

For my personal take on Fast I/O see: http://blog.covecube.com/2013/05/stablebit-drivepool-2-0-0-256-beta-performance-ui

(scroll down to the Fast I/O section)

I believe that I've described it rather well in that post. In short, it's not important but good to have.

We are due for a new version today, but I've delayed making a release because of this issue. I am seriously considering disabling Fast I/O read requests on Windows 8 in future builds due to this issue.

Alex · June 1, 2013

And another dev question.

When you say that "The crash occurs at the point when the function tries to access the MdlFlags member of mdl1", are you talking about the access in the if test or the do...while loop?

In other terms, is the first mdl causing the crash, or one in the linked list?

Here is the decompiled code from srv2.sys that is crashing.

  if ( mdl1 && mdl1 != mdl2 && !(mdl1->MdlFlags & MDL_SOURCE_IS_NONPAGED_POOL) )
  {
    do
    {
      mdlCurrent = mdl1;
      mdlFlags = mdl1->MdlFlags;
      mdl1 = mdl1->Next;
      if ( mdlFlags & (MDL_PARTIAL_HAS_BEEN_MAPPED | MDL_PAGES_LOCKED) )
      {
        MmUnlockPages(mdlCurrent);
      }
      IoFreeMdl(mdlCurrent);
    }
    while ( mdl1 );
    *(_DWORD *)(Length + 4) = 0;
  }

So right in the if statement, srv2 is trying to test whether mdl1->MdlFlags contains the MDL_SOURCE_IS_NONPAGED_POOL value.

mdl1->MdlFlags is an invalid statement. mdl1 is already freed at this point, it has no flags (it doesn't exist).

DrParis · June 1, 2013

We are due for a new version today, but I've delayed making a release because of this issue. I am seriously considering disabling Fast I/O read requests on Windows 8 in future builds due to this issue.

Hi Alex,

If you want me to beta test a version with Fast IO disabled for a while on my setup, I am your guy.

(what a good user I am isn't it )

Salty · June 2, 2013

Just wanted to weigh in on this discussion, interestingly for me my DrivePool instance is installed on a physical server which is also the Hyper-V host for around 8 VMs. I found Alex's correlation to the Hyper-V drivers interesting due to my use of Hyper-V on this host.

I have always had Network IO boost enabled on this server and I would say that my BSODs occur far less than some of my colleagues - we didn't realise this correlation with Network IO boost but I'll be letting them all know tomorrow

I'm not sure of the frequency for other people but I have only had 2 BSODs in the last 3 weeks which definitely produces a stack trace almost identical to the one described above. So from my perspective, the Network IO boost enabled hasn't eliminated this issue but may be reducing its frequency.

DrParis · June 2, 2013

Reproduced the BSOD on my host OS, with this setup:

- Server 2012 installed and activated

- all drivers installed

- no update at all from MS

- DP 2.0.0.260 installed

- verifier, standard profile, all drivers

- DP service exe launched, logging stopped and restarted

Currently uploading a 7z archive with the same as before:

- full MEMORY.DMP

- the Minidumps

- the DP logs

Will continue my tests for the Net IO boost on the VM (do not like the BSODs on the host OS).

For the moment it is stable on my side.

From what others say, it seems nevertheless that the crashes may still happen with a lower frequency, I will keep you in touch if it is the case for me as well...

lee1978 · June 2, 2013

DrParis · June 2, 2013

I'm not sure of the frequency for other people but I have only had 2 BSODs in the last 3 weeks which definitely produces a stack trace almost identical to the one described above. So from my perspective, the Network IO boost enabled hasn't eliminated this issue but may be reducing its frequency.

For me the BSOD frequency is like this:

- no Net IO Boost, host OS without any update or VM OS with all update: immediate, the BSOD happens after a few reads over the share

- Net IO Boost activated: I do not know for now... seems stable on both host OS without update and VM with updates

akshoslaa · June 3, 2013

Weighing in:

I switched to DrivePool about a week ago, from DriveBender - which was experiencing BSODs with exactly the same symptoms - only the driver that was falling down was vhyperdrive4.sys (one of their files). Their software bloat had been irritating me a lot of late, so I was happy enough to blame them, and switch - but when the BSODs continued with the new software I started looking more closely.

They seem to occur more frequently while performing a lot of IO across the network, which would certainly fit with your finding above. (Also, as above - I have DP installed on a Hyper-V host). Analysis of the srv2.sys minidumps shows the same srv2!Smb2ContinueUncachedRead crash. Analysis of the vhyperdrive4.sys minidumps doesn't give me much (ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.)

A lot of the issues I've had with DriveBender over the years have beem down to disks that were going funky. Looked yesterday with your Scanner software, and found a SMART warning on one of the disks. (LOAD CYCLE COUNT on a WD Green Disk - not unexpected - but removed the disk from the pool anyway.) No BSOD _yet_ after removing the disk, but I haven't been thrashing network IO, so I doubt it's really made a difference.

If any of my minidumps will help you in tracking this down, let me know - happy to help if I can. Bluescreens aside, I'm really liking DrivePool so far.

Alex · June 3, 2013

I've received 3 more dumps and will review them to see if they correlate with everything that I've seen so far.

I'll try to get a new build out today that, at the very least, will turn off Fast I/O cached reads on the Win 8 kernel (which are causing srv2.sys to run the code that is crashing the system).

DrParis · June 3, 2013

Hi Alex, will try this new build out If it makes the system more stable, would be excellent!

Nevertheless, it seems to me that the srv2.sys crach is only the visible part of the iceberg.

Even if srv2 does not free an already freed memory at this point, it means that something somewhere is freeing it illegally isn't it?

Hope you will find time for my host OS dump, and continue the good job

Alex · June 3, 2013

Ok, here are new builds that have Fast I/O reads disabled on the Windows 8 kernel:

This should prevent this crash from occurring while I keep investigating.

Thanks everyone for helping out and being patient with this issue.

Changes

2.0.0.263
---------
* [D] Disabled Fast I/O network reads on Windows 8 / 2012 as a workaround for a system crash.
* Added "Pool options" text to the pool options button.
* [D] When moving a directory with an explicit per-folder duplication (or that has sub-directories with explicit 
      per-folder duplication counts set) to a new folder on the same pool, make sure to correctly re-propagate 
      the M/I flags.
* [D] Deleting a directory that had per-folder duplication enabled, and then recreating a new directory 
      with the same name would incorrectly mark the new folder as duplicated as well (usually until a reboot).
* [D] Fixed setting per-folder duplication on read-only folder (or folders whose parents may be read-only).

Codegear · June 4, 2013

Alex,

This new build fixes problems I had when creating new folders. Thanks!

I will later confirm if the reboot are avoided.

[EDIT June 6th]

There is still some share I have the rename problem... weird.

On the positive side, no more reboot.

DrParis · June 4, 2013

* [D] Fixed setting per-folder duplication on read-only folder (or folders whose parents may be read-only).

Great Will try that this WE!

Alex · June 7, 2013

DrParis and others,

I am fairly certain now that the crashes experienced in Windows 8 / 2012 (by some of you) are a result of a bug in srv2.sys.

There is, what we call, a race condition, where the same memory address gets freed twice. In other words, it's a bug.

I've implemented a workaround in the latest build that should overcome this problem.

As for why this doesn't happen with NTFS? That is because Windows 8 doesn't do Fast I/O Reads on NTFS. I suspect that this is hard coded somewhere.

tmservo · June 7, 2013

I was running into this exact same error on WHS2011. SRV2 would crash in the middle of large copies (100Gb or more), almost every time. .265 seems to have kind of fixed it, however, there are a few bugs with it too.. but mostly cosmetic:

* The "performance graph" for a pool is now wrong; if I'm writing to poolA, results show up in poolB, etc.. very strange.

DrParis · June 8, 2013

There is, what we call, a race condition, where the same memory address gets freed twice. In other words, it's a bug.

I've implemented a workaround in the latest build that should overcome this problem.

As for why this doesn't happen with NTFS? That is because Windows 8 doesn't do Fast I/O Reads on NTFS. I suspect that this is hard coded somewhere.

Hi Alex, thanks a lot for the feedback. Installing 2.0.0.265 right now and testing.

You should definitely submit your conclusions to Microsoft so that they can address this problem.

Honestly they are not very serious, having such a bug on a Server OS that is supposed to be able to serve large amount of files...

Alex · June 8, 2013

It looks like this issue is not occurring in build 265.

If anyone still experiences srv2.sys BSODs in build 265 or later, please let me know.

DrParis · June 8, 2013

Ok, what happened to my post regarding the BSOD while installing build 265?

Just found it... here.

Alex · June 11, 2013

Hi Alex, thanks a lot for the feedback. Installing 2.0.0.265 right now and testing.

You should definitely submit your conclusions to Microsoft so that they can address this problem.

Honestly they are not very serious, having such a bug on a Server OS that is supposed to be able to serve large amount of files...

As far as I can see, this bug does not affect NTFS as it doesn't exercise the same code in srv2.sys. Our file system exercises code in srv2.sys that NTFS never uses.

Sign In

BSOD in srv2.sys on Windows 8 / 2012

Question

Alex

22 answers to this question

Recommended Posts

DrParis

DrParis

Samael

Alex

Alex

DrParis

Salty

DrParis

lee1978

DrParis

akshoslaa

Alex

DrParis

Alex

Codegear

DrParis

Alex

tmservo

DrParis

Alex

DrParis

Alex

Join the conversation

Browse

Activity