Jump to content
  • 0

Trying to Identify "Bad Drive" when there are no SMART warnings


pofo14

Question

First just wanted to start by thanking the community, great support and a lot of useful info in the forums.

 

I recently posted this topic because I was having some issue's getting my "snapraid sync" command to complete and was asking about other options for my setup.  I got some good advice from you guys, as well in the snapraid forum here.  Turns out, one of my parity drives was bad.  Scanner wasn't reporting any issue's with it, and even the snapraid smart command, didn't have any errors.  The snapraid command did say that the parity drive had a 100% chance of failing in the next year, which gave me enough of a clue.  I confirmed this drive was the culprit after running the sync command again, I saw that this specific disk (x:\) was stuck as 100% usage in Task Manager and the snapraid process was "hung".  I also checked the event log to find the following errors happening:

Event 129, storachi - Reset to Device \Device\RaidPort2, was issued. It appears every 30 seconds, from about the time I started the sync commamd (or it "hangs), until I restarted.


Event 140: NTFS - The system failed to flush the data to the transaction log. Corruption may occur in VolumeID: X:, Devicename: \Device\HarddiskVolume8


Event 153, disk - The IO Operation at logical block address 0x1 for Disk9 (PDO Name: \Device\00000036) was retired
Event 153, disk - The IO Operation at logical block address 0x11 for Disk9 (PDO Name: \Device\00000036) was retired
Event 153, disk - The IO Operation at logical block address 0x21 for Disk9 (PDO Name: \Device\00000036) was retired

I understand that drives can go bad without Scanner reporting an issue as described in this forum here.  The errors from that post seem identical to one of the errors I have listed above.  Actually I have been struggling with the 129 & 153 warnings for a while as described in this post.  One main question I have is how can I determine the actual physical drive that the 129 and 153 errors are happening for?  I believe based on this experience I have now determined it, but it was really the Event 140 that made it obvious as it listed the drive letter.  Without that I may not have known which drive was causing these problems.  Essentially how can I determine the drive for \Device\RaidPort2 and \Device\00000036?  I am hoping that they are one and the same, and they also point to the drive from Event 140 (x:\).

 

For those interested I got around the snapraid sync issue, but removing the bad drive from my configuration, and reverting to a single parity drive.  After doing that the I ran a full sync, which seems to have completed successfully.  I have a new drive en route to replace the bad parity drive.

 

I really am asking as I want to verify that the three errors above all pertain to the drive I am replacing and not another one currently in my PC.  

 

Any help is much appreciated.  

 

 

EDIT-----

 

I got the brand new drive.  Put it in.  Configured this new drive as a parity drive, tried to run a sync command, and it froze.  In the Event Log I again see the "Event 129, storachi - Reset to Device \Device\RaidPort2, was issued."  So this seems to be an issue when I connect a drive to that SATA Port, or at least when a drive connected to that port attempts to be read / write to.

 

Does this mean it may not be the drive, and perhaps the sata port / controller on the motherboard?

 

Additionally I still get the error below (although the Disk changed from 9 to 12).  If I look in disk Management Tool, Disk 12 is actually one of my Pooled drive in DrivePool, meaning it is one of the Virtual Drives Drivepool creates.  

 

Event 153, disk - The IO Operation at logical block address 0x1 for Disk12 (PDO Name: \Device\00000036) was retired

 

I also notice now :

 

Event 153, disk - The IO Operation at logical block address 0x1 for Disk13 (PDO Name: \Device\00000037) was retired.  

 

Disk 13 in Disk Management is another virtual drive Drivepool creates.

 

The 153 warnings nay not be related to the other issue, as they have nothing to do with snapraid, but I am not sure if those errors are masking an issue with an underlying disk.

 

Thanks in advance,

Ken

Link to comment
Share on other sites

13 answers to this question

Recommended Posts

  • 0

To simplify this post:

 

What is the cause of these errors, which are happening on my virtual DrivePool Drives?  Do I need to worry about them?

 

Event 153, disk - The IO Operation at logical block address 0x1 for Disk12 (PDO Name: \Device\00000036) was retired

 

Event 153, disk - The IO Operation at logical block address 0x1 for Disk13 (PDO Name: \Device\00000037) was retired.  

 

Is error I get below when I run a snapraid sync, as well as when Scanner checks the drive, a problem with the drive or the connection?  It seems to happen regardless of what drive I attach to that specific cable / port.

 

Event 129, storachi - Reset to Device \Device\RaidPort2, was issued.

Link to comment
Share on other sites

  • 0

I think it must be a drive pool or possible LSI thing as i am getting the same thing i.e. 129 and 153 events

 

for me the 153 are on the pool drive same as you i think - and Chris indicated when i raised it a while back as not something to worry about - some days i get lots of warnings some none - i updated the drivers, bios and firmware on my LSI card yesterday and only have one so far today but 40 in last 7 days.

 

as for 129 - none so far today - so possibly thats cured(?) only difference from you was iaStorA rather than storachi - possible SAS vs SATA difference - i have only had 5 in the last 7 days

 

used to get a lot of 153 when i had Rocket Raid cards which i have replaced - but they had loads of other issues so never bothered with these warnings

Link to comment
Share on other sites

  • 0

I think it must be a drive pool or possible LSI thing as i am getting the same thing i.e. 129 and 153 events

 

for me the 153 are on the pool drive same as you i think - and Chris indicated when i raised it a while back as not something to worry about - some days i get lots of warnings some none - i updated the drivers, bios and firmware on my LSI card yesterday and only have one so far today but 40 in last 7 days.

 

as for 129 - none so far today - so possibly thats cured(?) only difference from you was iaStorA rather than storachi - possible SAS vs SATA difference - i have only had 5 in the last 7 days

 

used to get a lot of 153 when i had Rocket Raid cards which i have replaced - but they had loads of other issues so never bothered with these warnings

 

Good to hear the 153 are nothing to worry about.  Interestingly I do have a RocketRaid card installed, although I don't think the errors are happening for the drives attached to that.  If there was a way someone could intruct me how to figure out the physical volume for the \Device\RaidPort2 it would be helpful.

 

Regarding the 129 error, I haven't had it happen again today, but I think if I try to run anything against the disk I am suspecting I will see it happen.

Link to comment
Share on other sites

  • 0

That's the thing. The drives on the rocket raid (2680) are just general drivesfor me, and just passed through, meaning I am not using any raid on the card. The drive on the rocketraid card are not in the snapraid array. They are however pooled together in Drive pool as a big drive for general / temporary data.

Link to comment
Share on other sites

  • 0

To be blunt:

 

LBA retries and raid port resens are entirely common.  These happen all the time, and may be perfectly normal. 
Though the "flush logs" isnt' as common, it can still happen, in certain circumstances.

 

However, if youre seeing a large number of these in short succession, ESPECIALLY when accompanied by disk and/or NTFS errors, then it may indicate an issue.

 

As for what drive this is, run "msinfo32", and check the "component" section, and this may help you to identify the disk in question.

Though, IIRC, the "Disk9" woudl be the same "disk 9" in Disk Management.

 

Also, running a "Burst test" in StableBit Scanner may reveal issues, as well. 

 

 

That said, does SnapRAID not indicate the specific drive? even it its logs?

Link to comment
Share on other sites

  • 0

Christopher I recently had to try and find a drive with problems and was in format \Device\HarddiskVolume it was a lot of messing around using diskpart then I had to add drive letters to find out the drive eventually.  Can you add a column in Scanner that shows either the Label so you don't have to allocate a drive letter to every drive or can you get info for \Device\HarddiskVolume into scanner column. Currently it is not intuitive and difficult to find the drive with problems with Scanner from the event log. 

 

I had messages like

 

Volume \\?\Volume{2f02bdf1-0fa7-4adf-9bfd-2b630bfc0e39} (\Device\HarddiskVolume6) requires an Online Scan.  An Online Scan will automatically run as part of the next scheduled maintenance task.  Alternatively you may run "CHKDSK /SCAN" locally via the command line, or run "REPAIR-VOLUME <drive:> -SCAN" locally or remotely via PowerShell.

 

It wasn't Disk 6 as the HarddiskVolume won't match because there are often more than one volume per drive like your C:\

 

I have found it is best with drives in a pool not to have drive letters to avoid mistakes inadvertently messing with DrivePool.

 

Maybe I am misunderstanding how to interpret the information and it is in Scanner. 

Link to comment
Share on other sites

  • 0

Well, there is a "disk" ID. This is the same number that is used in DISKPART and disk management. 

 

As for adding the Volume info... it could be done, i'm sure, but StableBit Scanner primarily deals with disks, not the volumes.

 

 

That said, I highly recommend NOT using the online scan option for CHKDSK or the powershell commands.

The reason for this, is that actually skip a number of checks that would normally occur when performing an "offline scan". 

 

For instance, they do not check the MFT, so a damaged/corrupt MFT will not be addressed by the online scan.

(FYI, the MFT is the file allocation table for NTFS, and stands for "Master File Table"). 

 

 

Also, it's worth noting that Windows has 4-5 different ways it identifies disks/volumes, and it's not entirely consistent.  

Link to comment
Share on other sites

  • 0

To be blunt:

 

LBA retries and raid port resens are entirely common.  These happen all the time, and may be perfectly normal. 

Though the "flush logs" isnt' as common, it can still happen, in certain circumstances.

 

However, if youre seeing a large number of these in short succession, ESPECIALLY when accompanied by disk and/or NTFS errors, then it may indicate an issue.

 

As for what drive this is, run "msinfo32", and check the "component" section, and this may help you to identify the disk in question.

Though, IIRC, the "Disk9" woudl be the same "disk 9" in Disk Management.

 

Also, running a "Burst test" in StableBit Scanner may reveal issues, as well. 

 

 

That said, does SnapRAID not indicate the specific drive? even it its logs?

 

Thanks for getting back.  All these steps were helpful, and I believe my issue was a bad SATA cable.  I ran the burst test, on the new drive connected in the old port, with the old cable and had errors reported.  After changing the cable the burst test was successful.  I was also able to successfully complete a sync command in snapraid.  

 

This has also removed most of the errors from the event log, with the exception of the following.  But as you stated this may not be an issue.  Interestingly though, they seem to consistently appear at 12:30 consistently, so I am not sure there is something firing off in Scanner or Drivepool that causes them.  Or perhaps there is another job/process running that causes them.

 

Event 153, disk - The IO Operation at logical block address 0x1 for Disk12 (PDO Name: \Device\00000036) was retired

 

Event 153, disk - The IO Operation at logical block address 0x1 for Disk13 (PDO Name: \Device\00000037) was retired.  

 

Presumably the old drive may have been fine, and I probably could / should hook it back up to validate it was the cable, but since everything is working I am not going to mess with it.  I'll just say this all was a good opportunity to get a new drive.  

 

Thanks again for the help.  

Link to comment
Share on other sites

  • 0

Well, I'm glad to hear that StableBit Scanner helped you to identify the bad SATA cable (I've experienced this and weirder, as well).

 

Asf or the LBA errors, I'm guessing that the drives were asleep/idle and being woken up (LAB sector 1... the first sector on the disk...).   But I'm not 100% sure.

 

And I'm glad that everything appears to be working fine now!

Link to comment
Share on other sites

  • 0

Well, I'm glad to hear that StableBit Scanner helped you to identify the bad SATA cable (I've experienced this and weirder, as well).

 

Asf or the LBA errors, I'm guessing that the drives were asleep/idle and being woken up (LAB sector 1... the first sector on the disk...).   But I'm not 100% sure.

 

And I'm glad that everything appears to be working fine now!

 

Your guess seems correct, as usual.

 

I knew something was running at that time, and realized that I have an Acronis daily backup that runs at 12:30, exactly the time the errors occur.  While Acronis doesn't backup the drivepool drives, or any of the "data" drives (only backing up the system drive), it does wake the computer which I assume is what triggers the error.  

 

At this point it's no harm no foul.  

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...