Jump to content
Covecube Inc.
  • 0
IMuijtjens

High Interface CRC Error Count

Question

Hello,

 

Today I noticed the S.M.A.R.T. values of my hard drives in my storage server are reporting a huge count of Interface CRC Errors. All disks are showing these errors. I'm using ESXI on my Home Server. 2 VM's (1 web server and 1 storage server). 7 hard disks (8TB) are connected on my LSI SAS 9211-8i controller, which is flashed in IT Mode (JBOD). I configured pass-through and linked the controller to my storage VM. The error count continues to increase, but the performance of the disks stays ok. I did a copy from my data pool to my backup pool. 100GB of data copied in 13 minutes. After the copy completed, the error count was increased on all the disks.

 

I'm wondering what is causing these errors. All the disks in my Storage VM are showing these errors. The hard drive which is connected to my web server (RDM) is not showing these errors. That disk is connected on a SATA-port on my motherboard. Could it be the pass-through configuration which causing issues?

HD Tune Pro: ATA     WDC WD80EFZX-68U Health

ID                              Current  Worst    ThresholdData          Status     
(01) Raw Read Error Rate        100      100      16       0             ok         
(02) Throughput Performance     132      132      54       112           ok         
(03) Spin Up Time               145      145      24       38684393924   ok         
(04) Start/Stop Count           100      100      0        225           ok         
(05) Reallocated Sector Count   100      100      5        0             ok         
(07) Seek Error Rate            100      100      67       0             ok         
(08) Seek Time Performance      128      128      20       18            ok         
(09) Power On Hours Count       100      100      0        1270          ok         
(0A) Spin Retry Count           100      100      60       0             ok         
(0C) Power Cycle Count          100      100      0        220           ok         
(16) Unknown Attribute          100      100      25       100           ok         
(C0) Unsafe Shutdown Count      100      100      0        275           ok         
(C1) Load Cycle Count           100      100      0        275           ok         
(C2) Temperature                176      176      0        214749478946  ok         
(C4) Reallocated Event Count    100      100      0        0             ok         
(C5) Current Pending Sector     100      100      0        0             ok         
(C6) Offline Uncorrectable      100      100      0        0             ok         
(C7) Interface CRC Error Count  200      200      0        9163          attention  

Health Status         : ok

Setup:

 

post-7337-0-19945200-1502637380_thumb.png

Share this post


Link to post
Share on other sites

19 answers to this question

Recommended Posts

  • 0

in a non vm environment those errors mean something is up with the cable/controller or the hard disk controller/sata connector

 

as its all your disks - it could be the controller

 

as your other drive is not reporting errors because its on a different controller

 

not seen it with VM's before

 

if you connect one of the affected disks to the m/b sata controller do the errors stop?

Share this post


Link to post
Share on other sites
  • 0

After a lot of reasearch and sweating I finnaly figured out the issue. This issue was driving me nuts.

 

Looks like it was the firmwire version on the LSI SAS 9211-8i controller. The controller was flashed with version P20. This firmwire version can cause a lot of troubles like drives falling out of the raid, CRC Errors etc. I flashed the controller back to firmwire version P19, reconnected my drives, and tried some file copies on the drive pool. None of the drives increased the error count. It's a shame the errors will always stay visible in the S.M.A.R.T. details, but I'm glad this fixed the issue.

Share this post


Link to post
Share on other sites
  • 0

interesting i will have to check which firmware my 9211 is on - thought it was p20 but might be wrong - although i do not have any of the crc errors fortunately

 

Are you using 4TB disks? I saw someone who was using 4TB Dekstar disks with no issues running the P20 firmware, he had issues with a 6TB Dekstar NAS disk. Perhaps it depends on the brand/size of the disk. I'm using 5x WD Gold 8TB and 2x WD Red 8TB which are both failing under firmware P20.

Share this post


Link to post
Share on other sites
  • 0

Sorry for the thread necro, but can anyone confirm if this issue was fixed in later revisions of the P20 firmware?

I've just bought a 9211-8i off the bay, and the first thing I did was flash it to IT mode using the latest firmware on Broadcom's site (P20 rev. 20.00.07.00), but after reading around the interwebs it seems that CRC errors are a common theme at least with the P20 initial release.

I haven't put the card into service yet, and I'd rather not learn the hard way that the problem is ongoing... I suppose I could just play it safe and drop back to P19 or earlier, but not having the newest and shiniest firmware would make me sad. More importantly, I hate flashing firmware at any time and I'd rather avoid doing it again if at all possible.

Thanks in advance for any advice.

 

Share this post


Link to post
Share on other sites
  • 0

That's good to know, thanks for the comment Christopher.  I haven't put my new adapter into service yet, and had it sitting at rev 20.  I'll grab a copy of 19 and see about re-flashing.

Share this post


Link to post
Share on other sites
  • 0
15 minutes ago, Jaga said:

That's good to know, thanks for the comment Christopher.  I haven't put my new adapter into service yet, and had it sitting at rev 20.  I'll grab a copy of 19 and see about re-flashing.

if that's the 16i card, then it may be different, as it's using a slightly different chip, IIRC. 

Share this post


Link to post
Share on other sites
  • 0

Nope, mine's the 9201-16e.  Doing a little internet research to see if it's actually necessary to down-flash from 20 to 19 before I do it.  The Megaraid software under Windows can be a little hard to work with.  But I'd much rather know before I put it into service if it's going to be spitting out CRC errors left and right, so I'm treating it with deserved attention.  If you have any recommendations, I'm totally open.  It won't go into service until tomorrow night (fingers crossed).

Share this post


Link to post
Share on other sites
  • 0

16i, 16e, "close enough". :D  
The big part being the number of ports. 

And checking the page: https://www.broadcom.com/products/storage/host-bus-adapters/sas-9202-16e#specifications

Yeah, it'st he same controller (LSI SAS 2008).

So it may be a good idea to use the P19 firmware then.

Share this post


Link to post
Share on other sites
  • 0

Roger.  Already grabbed it, just have to find the least troublesome way to downgrade the firmware.  Thanks for the feedback and the heads up.  Would have driven me nuts!  :D

Edit:   done.   You wouldn't believe what I had to go through to revert back to P19 from P20.  Motherboard's built-in UEFI shell, hard-to-find sas2flash.efi (Broadcom doesn't offer it on their product support pages anymore), flash the firmware AND the BIOS back to a paired state.

All good now however, feel better about it.  I'm a little surprised honestly that with CRC problems for certain cards, Broadcom is even allowing download of P20 firmware.   

Share this post


Link to post
Share on other sites
  • 0

Thank you Christopher, P19 it is then.

As above, I'm a bit surprised that the obviously buggy P20 is still available, especially given that these cards and their variants will be used in enterprise settings.

Share this post


Link to post
Share on other sites
  • 0

In my quest to get passthrough SMART data, I found out some new information on the P20 CRC errors issue:

Turns out the original release of P20 (v20.00.00.00) was the buggy one.  The latest available off their support website (v20.00.07.00) doesn't have the CRC error problem anymore.  I've verified by re-checking SMART data after over 8+ TB of write/read during migration.  I can also confirm that there is no bit corruption going on, as I was using FastCopy with verification turned on, and not once did it report errors.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×