SAS preventive disk replacement

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

SAS preventive disk replacement

Gandalf Corvotempesta

I'm still here asking the following question because month ago nobody replied

In a SAS disks,  which values should i look for when prevenyively replacing a disk?

Should i look for "elements in grown defect list"?
Should i look for the uncorrected errors in the below table reporting writes/reads/verifies?

Should i look for something else in the "-x" output?

Can someone explain this to me?
Docs on smartmon page is not detailed about sas


------------------------------------------------------------------------------

_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: SAS preventive disk replacement

L.A. Walsh
Gandalf Corvotempesta wrote:
>
> I'm still here asking the following question because month ago nobody
> replied
>
> In a SAS disks,  which values should i look for when prevenyively
> replacing a disk?
>
----
    Do SAS disks have the same or similar fields for 'smartmon'
as ATA/SATA disks?  FWIW -- nobody may have responded because,
like me, no one really knew an authoritative answer.  That said,
I'll just spout off "unauthoritatively"... (YMMV)...

> Should i look for "elements in grown defect list"?
> Should i look for the uncorrected errors in the below table reporting
> writes/reads/verifies?
>
> Should i look for something else in the "-x" output?
>
----
    Dang... that's one thing about smartmon, is that for better
or worse, it makes the "call" based on its recorded data.  If SAS
doesn't have similar, someone would have to know how the various
parameters collected affect failure rate.

    I think I read a report by google that said the single biggest
correlating factor in failed disks was temperature -- though I don't
know if it was 'max temperature' or 'daily-max-averaged' or what...

> Can someone explain this to me?
> Docs on smartmon page is not detailed about sas
>
---
    smartmon was originally invented for consumer ATA disks which
eventually became SATA disks.  I don't know that the regulating
committees for SAS disks did the same or adopted the same numbers
and failure guidelines as for SATA disks.

    That's all from 10-15+ YO-memory as I don't do alot of smartmon
monitoring as my disks are behind a RAID controller that will kick
the disk out as "bad" if it gets out-of-sync with the rest of the disks
due to sector remappping.  I.e. pretty much the 1st time a sector
gets remapped and causes a slowdown -- if it exceeds some threshold
in the LSI controller, it will just mark it as bad.

    Before that -- any time I noticed, or "heard" a disk (before SSD's)
doing a "retry" -- I scheduled it for replacement.

    Had an interesting data point when I accidently got a
load of consumer-grade disks instead of enterprise and decided to
try them anyway.  Out of 24 disks, only 3 were marked good.  It wasn't
because of remapping, but the disks' RPMs: they varied by 10-15% from
slowest to fastest.

    When I got the replacements, they were all, right on 7200RPM.

    Good luck in finding your answers.  Did you google?  ;-)

-l




------------------------------------------------------------------------------
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: SAS preventive disk replacement

Gandalf Corvotempesta
2016-11-26 22:44 GMT+01:00 L.A. Walsh <[hidden email]>:
>    Do SAS disks have the same or similar fields for 'smartmon'
> as ATA/SATA disks?  FWIW -- nobody may have responded because,
> like me, no one really knew an authoritative answer.  That said,
> I'll just spout off "unauthoritatively"... (YMMV)...

No, totally different output:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.10.0+2] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               SEAGATE
Product:              ST300MP0005
Revision:             VT31
User Capacity:        300,000,000,000 bytes [300 GB]
Logical block size:   512 bytes
Logical Unit id:      0x5000c500962d6707
Serial number:        S7K0XLW9
Device type:          disk
Transport protocol:   SAS
Local Time is:        Sun Nov 27 10:34:14 2016 CET
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        60 C
Manufactured in week 03 of year 2016
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  13
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  242
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 2780454742
  Blocks received from initiator = 93770794
  Blocks read from cache and sent to initiator = 1077886413
  Number of read and write commands whose size <= segment size = 386420548
  Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 5519,43
  number of minutes until next internal SMART test = 20

Error counter log:
           Errors Corrected by           Total   Correction
Gigabytes    Total
               ECC          rereads/    errors   algorithm
processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9
bytes]  errors
read:   2030615210        1         0  2030615211          1
266092,169           0
write:         0        0         0         0          0
11228,247           0
verify: 4273860022        0         0  4273860022          0
10059,776           0

Non-medium error count:        3



>    Good luck in finding your answers.  Did you google?  ;-)

Googled a lot, without success.

------------------------------------------------------------------------------
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: SAS preventive disk replacement

Håkon Alstadheim
In reply to this post by L.A. Walsh


Den 26. nov. 2016 22:44, skrev L.A. Walsh:

> Gandalf Corvotempesta wrote:
> Should i look for "elements in grown defect list"?
>> Should i look for the uncorrected errors in the below table reporting
>> writes/reads/verifies?
>>
>> Should i look for something else in the "-x" output?
>>
> ----
>     Dang... that's one thing about smartmon, is that for better
> or worse, it makes the "call" based on its recorded data.  If SAS
> doesn't have similar, someone would have to know how the various
> parameters collected affect failure rate.
>
>     I think I read a report by google that said the single biggest
> correlating factor in failed disks was temperature -- though I don't
> know if it was 'max temperature' or 'daily-max-averaged' or what...
>
If you run your drives within the temperature tolerance, then what
matters most is temperature /variability/ . I have some seagate SAS
drives that have a max temperature gradient (10 deg./ hour ?) specified.
You should be able to keep temperature changes way lower than that.
Other than that, total max-min span in temperature could also be meaningful.

In addition to google, reading the spec.s on your drives could give some
insight.



------------------------------------------------------------------------------
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: SAS preventive disk replacement

Gandalf Corvotempesta

Il 28 nov 2016 22:36, "Håkon Alstadheim" <[hidden email]> ha scritto:
> If you run your drives within the temperature tolerance, then what
> matters most is temperature /variability/ . I have some seagate SAS
> drives that have a max temperature gradient (10 deg./ hour ?) specified.
> You should be able to keep temperature changes way lower than that.
> Other than that, total max-min span in temperature could also be meaningful.
>
> In addition to google, reading the spec.s on your drives could give some
> insight.
>

Temperature is ok.
so, the grown list or the uncorrected errors reported by smart aren't useful for proactive replacement?

I also see some strangeness coming from the extented output (-x) that i don't know how to interpret

If someone can give me some advice....


------------------------------------------------------------------------------

_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support