Trouble finding bad block on WDC WD1600SB-01KBA0

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Trouble finding bad block on WDC WD1600SB-01KBA0

mathog
Hi,

One system has a WDC WD1600SB-01KBA0 which shows

197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always
       -       1
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always
       -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always
       -       1

Sadly, it does not list the block number in any of the test results
(or the system log files).

Tried these steps to find the pending sector...

   reboot (to clear cache)
   # log in once it came back up
   dd if=/dev/sda of=/dev/null bs=512

which completed without error.  Then tried

    smartctl -t long /dev/sda

and that also completed without error.

However "smartctl -a " still shows a pending sector.

Is there some other trick to find the thing?

Thanks,

David Mathog
[hidden email]
Manager, Sequence Analysis Facility, Biology Division, Caltech

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

Carlos E. R.
On 2017-01-10 02:12, mathog wrote:

> Hi,
>
> One system has a WDC WD1600SB-01KBA0 which shows
>
> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always
>        -       1
> 198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always
>        -       0
> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always
>        -       1
>
> Sadly, it does not list the block number in any of the test results
> (or the system log files).
>
> Tried these steps to find the pending sector...
>
>    reboot (to clear cache)
There is a way to clear it without reboot. Let me see... [...]

> To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free
> dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free
> pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches

Or issue "sync" at the end.

> /sbin/sysctl -q -w vm.drop_caches=3
> using /sbin/sysctl is equivialent to the "echo >/proc/sys/..." line
> above


>    # log in once it came back up
>    dd if=/dev/sda of=/dev/null bs=512
>
> which completed without error.  Then tried
>
>     smartctl -t long /dev/sda
>
> and that also completed without error.
>
> However "smartctl -a " still shows a pending sector.
The same thing happened to me recently.

> Is there some other trick to find the thing?


I run "badblocks" with the intention of locating them, and they
disappeared...


--
Cheers / Saludos,

                Carlos E. R.
                (from 42.2 x86_64 "Malachite" at Telcontar)


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

signature.asc (188 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

mathog
On 10-Jan-2017 02:07, Carlos E. R. wrote:

> On 2017-01-10 02:12, mathog wrote:
>> Hi,
>>
>> One system has a WDC WD1600SB-01KBA0 which shows
>>
>> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age  
>> Always
>>        -       1
>> 198 Offline_Uncorrectable   0x0012   200   200   000    Old_age  
>> Always
>>        -       0
>> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age  
>> Always
>>        -       1
>
>> Is there some other trick to find the thing?
>
> I run "badblocks" with the intention of locating them, and they
> disappeared...

Rebooted that node into PLD Rescue CD over the network, ssh'd into it
and ran

    badblocks -nvs /dev/sda >/tmp/bb.log 2>&1 &

somewhere along the line the pending sector cleared, but there was no
message
giving the block number, and it said there were no errors.  The
UDMA_CRC_ERROR_COUNT is still 1.

So that worked.

Now, for the next time, is there a command one can use
while the OS is running and the disk mounted that can do something
similar?
badblocks -n isn't happy running on mounted disks, and that badblocks
command took ~4.5 hours.

Thanks,

David Mathog
[hidden email]
Manager, Sequence Analysis Facility, Biology Division, Caltech

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

Carlos E. R.
On 2017-01-11 02:15, mathog wrote:
> On 10-Jan-2017 02:07, Carlos E. R. wrote:

>> I run "badblocks" with the intention of locating them, and they
>> disappeared...
>
> Rebooted that node into PLD Rescue CD over the network, ssh'd into it
> and ran
>
>     badblocks -nvs /dev/sda >/tmp/bb.log 2>&1 &
>
> somewhere along the line the pending sector cleared, but there was no
> message
> giving the block number, and it said there were no errors.  The
> UDMA_CRC_ERROR_COUNT is still 1.
Yes, same thing here. I don't remember that parameter what value it had.


> So that worked.
>
> Now, for the next time, is there a command one can use
> while the OS is running and the disk mounted that can do something
> similar?

Previously I figured it out from the point that the long test stopped.

> badblocks -n isn't happy running on mounted disks, and that badblocks
> command took ~4.5 hours.

Yes, it runs for a very long time, yes. I'm unsure if my disk was
mounted or not.

--
Cheers / Saludos,

                Carlos E. R.
                (from 42.2 x86_64 "Malachite" at Telcontar)


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

signature.asc (188 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

Bruce Allen-2
In reply to this post by mathog
David, I think the UDMA_CRC_ERROR_COUNT is referring to a crc error on the data bus.  If that is right then it can be safely ignored; if it is recurring I would try and clean and replug the data connections to the drive.  Cheers, Bruce



> On 11 Jan 2017, at 02:15, mathog <[hidden email]> wrote:
>
> On 10-Jan-2017 02:07, Carlos E. R. wrote:
>> On 2017-01-10 02:12, mathog wrote:
>>> Hi,
>>>
>>> One system has a WDC WD1600SB-01KBA0 which shows
>>>
>>> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age
>>> Always
>>>       -       1
>>> 198 Offline_Uncorrectable   0x0012   200   200   000    Old_age
>>> Always
>>>       -       0
>>> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age
>>> Always
>>>       -       1
>>
>>> Is there some other trick to find the thing?
>>
>> I run "badblocks" with the intention of locating them, and they
>> disappeared...
>
> Rebooted that node into PLD Rescue CD over the network, ssh'd into it
> and ran
>
>    badblocks -nvs /dev/sda >/tmp/bb.log 2>&1 &
>
> somewhere along the line the pending sector cleared, but there was no
> message
> giving the block number, and it said there were no errors.  The
> UDMA_CRC_ERROR_COUNT is still 1.
>
> So that worked.
>
> Now, for the next time, is there a command one can use
> while the OS is running and the disk mounted that can do something
> similar?
> badblocks -n isn't happy running on mounted disks, and that badblocks
> command took ~4.5 hours.
>
> Thanks,
>
> David Mathog
> [hidden email]
> Manager, Sequence Analysis Facility, Biology Division, Caltech
>
> ------------------------------------------------------------------------------
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today. http://sdm.link/xeonphi
> _______________________________________________
> Smartmontools-support mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/smartmontools-support
--------------------------------------------------------------------
Bruce Allen, Adjunct Professor of Physics
Leonard E. Parker Center for Gravitation, Cosmology and Astrophysics
Physics Department
University of Wisconsin - Milwaukee
3135 N Maryland Ave
Milwaukee, 53211 USA
Tel: +1 414-229-4474
Fax: +1 414-229-5589
[hidden email]



------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

signature.asc (506 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

Robert Spotswood
I've also seen the UDMA_CRC_ERROR_COUNT caused by a bad SATA cable. With
only 1, I wouldn't sweat it. The case I remember, the count was in the
hundreds. It stopped climbing after I switched out the cable. The existing
one was noticeably frayed.

> David, I think the UDMA_CRC_ERROR_COUNT is referring to a crc error on the
> data bus.  If that is right then it can be safely ignored; if it is
> recurring I would try and clean and replug the data connections to the
> drive.  Cheers, Bruce
>
>
>
>> On 11 Jan 2017, at 02:15, mathog <[hidden email]> wrote:
>>
>> On 10-Jan-2017 02:07, Carlos E. R. wrote:
>>> On 2017-01-10 02:12, mathog wrote:
>>>> Hi,
>>>>
>>>> One system has a WDC WD1600SB-01KBA0 which shows
>>>>
>>>> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age
>>>> Always
>>>>       -       1
>>>> 198 Offline_Uncorrectable   0x0012   200   200   000    Old_age
>>>> Always
>>>>       -       0
>>>> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age
>>>> Always
>>>>       -       1
>>>
>>>> Is there some other trick to find the thing?
>>>
>>> I run "badblocks" with the intention of locating them, and they
>>> disappeared...
>>
>> Rebooted that node into PLD Rescue CD over the network, ssh'd into it
>> and ran
>>
>>    badblocks -nvs /dev/sda >/tmp/bb.log 2>&1 &
>>
>> somewhere along the line the pending sector cleared, but there was no
>> message
>> giving the block number, and it said there were no errors.  The
>> UDMA_CRC_ERROR_COUNT is still 1.
>>
>> So that worked.
>>
>> Now, for the next time, is there a command one can use
>> while the OS is running and the disk mounted that can do something
>> similar?
>> badblocks -n isn't happy running on mounted disks, and that badblocks
>> command took ~4.5 hours.
>>
>> Thanks,
>>
>> David Mathog
>> [hidden email]
>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>>
>> ------------------------------------------------------------------------------
>> Developer Access Program for Intel Xeon Phi Processors
>> Access to Intel Xeon Phi processor-based developer platforms.
>> With one year of Intel Parallel Studio XE.
>> Training and support from Colfax.
>> Order your platform today. http://sdm.link/xeonphi
>> _______________________________________________
>> Smartmontools-support mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/smartmontools-support
>
> --------------------------------------------------------------------
> Bruce Allen, Adjunct Professor of Physics
> Leonard E. Parker Center for Gravitation, Cosmology and Astrophysics
> Physics Department
> University of Wisconsin - Milwaukee
> 3135 N Maryland Ave
> Milwaukee, 53211 USA
> Tel: +1 414-229-4474
> Fax: +1 414-229-5589
> [hidden email]
>
>
> ------------------------------------------------------------------------------
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today.
> http://sdm.link/xeonphi_______________________________________________
> Smartmontools-support mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/smartmontools-support
>



------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

mathog
On 11-Jan-2017 07:25, [hidden email] wrote:
> I've also seen the UDMA_CRC_ERROR_COUNT caused by a bad SATA cable.

OK, I will make a note to look at the cable if this unit ever has
problems like this again.  (It could have been a gamma ray or something
hitting a gate, right?)

Now I'm trying to understand what happened here.  My best guess is that
it went something like this (leaving out a few steps):

1.  some issue with cable, connectors, radiation etc. arose.
2.  a write to a specific block, presumably with new data, ran
     into (1) and failed.
3.  ??? the disk shuffled that data off to a temporary location
     (spare physical block, flash, or ?) and set the pending and
     UDMA_CRC_ERROR_COUNT.
4.  Read of entire disk found no errors because the disk retrieved
     either the [??? old or new] contents without problems.
5.  System was powered down for several minutes and started back up.
     The pending block and UDMA_CRC_ERROR_COUNT were still set.  
Presumably
     this means the pending data was stored in a nonvolatile location.
6.  badblocks -nvs read the bad block [??? old or new] data and then
     wrote it back to disk.  It saw no errors while doing so because
     (1) was not longer a problem.  This time the write succeeded
     and the pending block was reset to 0.  The reallocated block
     count stayed 0.  Either it didn't reallocate the block or it did and
     it didn't increment the counter.

So the question is - _which_ data is in that iffy block now?  Is it the
data which caused the failed write in the first place, or whatever was
there
before the write?  Hopefully it is the former!

Thanks,

David Mathog
[hidden email]
Manager, Sequence Analysis Facility, Biology Division, Caltech

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

L.A. Walsh
In reply to this post by mathog
mathog wrote:

> Hi,
>
> One system has a WDC WD1600SB-01KBA0 which shows
>
> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always
> 198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always
> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always
>
> Sadly, it does not list the block number in any of the test results
> (or the system log files).
>
> Is there some other trick to find the thing?
>  
----
    Most of the time, you can't find the exact sector, but its an
indication that the disk may had to move the data to a backup sector.

    Modern hard disks usually have 'tracks' of spare sectors that they can
reallocate (up to and including reallocating entire tracks) when they start
to detect weak and/or unreliable signals on _READ_.  They are a sign that
the disk is nearing the end of its useful life.

    The smart diagnostics are not intended to be exact diagnostics but an
_Early_Warning_ system -- meaning that you had better move that data off to
a safer location.

    Before the advent of the SMART diags, you could often hear a disk going
bad, as what was supposed to be sequential, linear reads, weren't.  You
could
hear the excess seeking as the disk had to seek over to the replacement
sectors
and back again.

    Really -- you should be ready to replace this disk "soon" (as soon
as you can) and use any remaining life in it to make sure everything on
it is backed
up.






------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

Carlos E. R.
In reply to this post by mathog
On 2017-01-11 18:42, mathog wrote:

> On 11-Jan-2017 07:25, robert@ wrote:
>> I've also seen the UDMA_CRC_ERROR_COUNT caused by a bad SATA cable.
>
> OK, I will make a note to look at the cable if this unit ever has
> problems like this again.  (It could have been a gamma ray or something
> hitting a gate, right?)
>
> Now I'm trying to understand what happened here.  My best guess is that
> it went something like this (leaving out a few steps):
>
> 1.  some issue with cable, connectors, radiation etc. arose.
> 2.  a write to a specific block, presumably with new data, ran
>      into (1) and failed.
If this happens during a write, the sector is reallocated. If it happens
during a read, reallocation is postponed and the sector noted. I don't
know if it creates a list and how to read that list.

Reallocation happens during an attempted write, to another permanent
location.

I don't know what happened during the badblock run, because it is a read
operation.


--
Cheers / Saludos,

                Carlos E. R.
                (from 42.2 x86_64 "Malachite" at Telcontar)


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

signature.asc (188 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

mathog
On 11-Jan-2017 12:58, Carlos E. R. wrote:
> If this happens during a write, the sector is reallocated. If it
> happens
> during a read, reallocation is postponed and the sector noted. I don't
> know if it creates a list and how to read that list.

Subsequent reads - of the whole disk, did not log any errors,  nor did
they clear the "current pending sector" count.  That's the odd part -
the disk had somewhere stored "there is some problem with block N" and
incremented the pending sector count, but it seems that several reads
from block N (wherever that was) which completed without error were not
enough to change its mind.

It seems like a major shortcoming in the SMART protocol that there is no
"list the pending sectors" command.  The disk must have this
information, otherwise we cannot explain the way it behaved in this
case.

>
> Reallocation happens during an attempted write, to another permanent
> location.

Agreed.

>
> I don't know what happened during the badblock run, because it is a
> read
> operation.

with -nvs there is also a write after the read.  It presumably read the
iffy block successfully (for about the 6th time) and when it wrote it
back the flag finally cleared.  It may or may not have been reallocated,
but if it was, the counter did not increment.  It seems about as likely
that the disk just cleared the flag.  Perhaps the firmware at that point
did a couple of read/write tests on its own and decided all was now OK.  
We can't really know what the disk does "underneath" the level we
interact with.

Regards,

David Mathog
[hidden email]
Manager, Sequence Analysis Facility, Biology Division, Caltech

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

Dan Lukes
In reply to this post by Carlos E. R.
On 11.1.2017 21:58, Carlos E. R. wrote:
> If this happens during a write, the sector is reallocated.

> Reallocation happens during an attempted write, to another permanent
> location.

Note the relocation may not occur if write request doesn't cover entire
physical sector (it may happen on "advanced format" disk). Just an error
may be returned instead here.

This behavior has been observed on WDC disk (but I don't remember the
exact model and firmware version).

So physical sector size and location needs to be taken into consideration.

Dan


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: Trouble finding bad block on WDC WD1600SB-01KBA0

Bruce Allen
In reply to this post by mathog
Hi David,

> It seems like a major shortcoming in the SMART protocol that there is no "list the pending sectors" command.  The disk must have this
> information, otherwise we cannot explain the way it behaved in this  case.

I agree.

The truth is that the entire SMART protocol is something of a hack.  It was first implemented by a couple of vendors, then turned into an SFF "specification" which was subsequently actively withdrawn (meaning: the industry did its best to destroy every copy of the document in existence).  Then VERY limited parts of that were included in the ATA specification, which were then gradually morphed into something with a different intent (on and off-line testing, rather than monitoring and failure prediction).  All in all, SMART is useful, but it's also very flawed.

My personal hope is that over the coming ten years, the SSD will replace the HDD, and the devices and algorithms that underlie the SSD will become reliable enough that almost all of the SMART protocol and features become irrelevant and fade away.  Time will tell.

Cheers,
        Bruce

--------------------------------------------------------------------------
Prof. Dr. Bruce Allen, Director
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Callinstrasse 38
D-30167 Hannover,  Germany
Tel +49-511-762-17145
Fax +49-511-762-17182
Email: [hidden email]



------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

signature.asc (506 bytes) Download Attachment