General questions: self-tests / ATA attributes / SCSI sense / smart return status

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

General questions: self-tests / ATA attributes / SCSI sense / smart return status

Michael Woon
Hi Smartmontools devs,


I'm writing for a bit of clarity about self-tests and what records they produce, and the smart health call.


As I understand from the documentation, calling "smartctl -H":

-returns the result of the SMART RETURN STATUS command -or- checks if any ATA attributes exceed thresholds in ATA drives
-checks for any error codes in the SCSI sense buffer.


As I understand from the documentation, fore- and back-ground checks update the self-test error log and certain ATA attributes, when they run. 

(does a self-test also update the SCSI sense buffer or any kind of stored values for SCSI devices?)
 


My main questions are:

-what does SMART RETURN STATUS evaluate?

alternatively stated:
-does the command -only- look at ATA attributes stored in the table and error codes in the SCSI sense buffer? or is the content of the self-test error log also a factor?

bottom line:
-If I want to be sure of the health of a disk, can I trust the smart health status (to include the result of the self-tests) or do I have to look at -both- the health status and the self-test error log?

or do I have the wrong angle on this:
-simply watch for a '0' exit code for an "all okay"?

minor questions about the exit codes:
-is it possible to have a set bit 3 (device failing) without a set bit 4 (attributes over threshold), and vice versa?
-at what point does a SCSI drive set the 6th bit in the error code? I have drives (SAS) that have some errors in their smartctl output, but don't set this bit when smartctl is run on them.
-does bit 7 really only work for SATA drives? (SCSI drives have a self test log too)



I'm monitoring order ~100 devices, and they're a range of things, from a pair of 200G SATA SSDs behind a RAID controller to a shelf full of 4TB SAS platters.
I've been looking at a few different nagios plugins and notice that about half of them check only attributes and health status, the other half also look at the error log, and only one watches the exit code.

CentOS 7, 3.10 kernel, smartmontools 6.2.4



Thanks!
Michael

------------------------------------------------------------------------------

_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: General questions: self-tests / ATA attributes / SCSI sense / smart return status

Christian Franke
Michael Woon wrote:

> Hi Smartmontools devs,
>
>
> I'm writing for a bit of clarity about self-tests and what records
> they produce, and the smart health call.
>
>
> As I understand from the documentation, calling "smartctl -H":
>
> -returns the result of the SMART RETURN STATUS command -/or- /checks
> if any ATA attributes exceed thresholds in ATA drives

Yes.


> -checks for any error codes in the SCSI sense buffer.
>

It checks ASC/ASCQ in SCSI IE log page (if supported) or in result of
REQUEST SENSE command.

Most remaining answers are for ATA/SATA only. SCSI/SAS differs
considerably. Some SCSI expert on this list might want to answer.


> As I understand from the documentation, fore- and back-ground checks
> update the self-test error log and certain ATA attributes, when they run.

There is no ATA "self-test error log". On completion of a self-test, a
new entry is usually added to the ATA self-test log(s). The ATA error
log(s) are typically not updated on read errors found during a self-test.


> My main questions are:
>
> -what does SMART RETURN STATUS evaluate?

Anything the author of the drive firmware decided to evaluate :-)

Recent versions of ATA ACS standards say:
"The SMART RETURN STATUS command causes the device to communicate the
reliability status of the device to the host."
If command returns failure(0x2c,0xf4): "The device has detected a
threshold exceeded condition."

Note that ATA SMART Attributes are not part of the standard. The SMART
READ THRESHOLDS command was declared obsolete in ATA-4 (1998).


>
> alternatively stated:
> -does the command -only- look at ATA attributes stored in the table
> and error codes in the SCSI sense buffer? or is the content of the
> self-test error log also a factor?
>

SMART RETURN STATUS does not return failure if any Read/Write error
occured. It usually will return failure if the number of spare blocks
for reallocation is below some threshold.


> bottom line:
> -If I want to be sure of the health of a disk, can I trust the smart
> health status (to include the result of the self-tests) or do I have
> to look at -both- the health status and the self-test error log?

If you want to proactively replace drives, I would recommend to watch
the number of reallocated sectors (e.g. use smartd with '-R 9! -r 9!'
directive). A failing SMART STATUS may occur (too?) late.


>
> or do I have the wrong angle on this:
> -simply watch for a '0' exit code for an "all okay"?
>

It depends, see above.


> minor questions about the exit codes:
> -is it possible to have a set bit 3 (device failing) without a set bit
> 4 (attributes over threshold), and vice versa?

Yes: if SMART RETURN STATUS returned failure but there is no attribute
<= threshold in the SMART DATA block, the SMART READ THRESHOLD command
did not work, etc...


> -at what point does a SCSI drive set the 6th bit in the error code? I
> have drives (SAS) that have some errors in their smartctl output, but
> don't set this bit when smartctl is run on them.

For some historic reason, bit 6 was never implemented for SCSI.


> -does bit 7 really only work for SATA drives? (SCSI drives have a self
> test log too)

Yes, it works "better" for ATA because newer long tests without error
clear the bit.

Thanks,
Christian


------------------------------------------------------------------------------
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|

Re: General questions: self-tests / ATA attributes / SCSI sense / smart return status

Michael Woon
I realised I didn't reply to the list with this question.

Just have a tiny clarification to ask for for this point:

- If SMART RETURN STATUS returns "OK" and attributes are >= threshold, does smartctl still report healthas "OK / PASSED"? (and in this case, returned with bit 3 not set and bit 4 set?)

I've put the rest of the questions in another, more accurately named thread.

Thanks!
Michael

2016-11-26 16:15 GMT+01:00 Michael Woon <[hidden email]>:
Hi Christian,

Thanks for the quick and excellent answer, that really cleared up a lot of things for me.


I just have a couple of quick follow-ups:

- If SMART RETURN STATUS returns "OK" and attributes are >= threshold, does smartctl still report healthas "OK / PASSED"? (and in this case, returned with bit 3 not set and bit 4 set?)


For any SCSI experts out there:

- How do I proactively watch drive health for replacement? With SATA, I watch reallocated sectors, amongst other things, and there's generally a lot of documentation and discussion about this, but with SCSI, I ______? (really couldn't find anything at all)

- I've been asking all these questions with the assumption that smartctl is the tool for this job. Could I wrong on this?



Thanks again!
Michael



2016-11-26 14:29 GMT+01:00 Christian Franke <[hidden email]>:
Michael Woon wrote:
Hi Smartmontools devs,


I'm writing for a bit of clarity about self-tests and what records they produce, and the smart health call.


As I understand from the documentation, calling "smartctl -H":

-returns the result of the SMART RETURN STATUS command -/or- /checks if any ATA attributes exceed thresholds in ATA drives

Yes.


-checks for any error codes in the SCSI sense buffer.


It checks ASC/ASCQ in SCSI IE log page (if supported) or in result of REQUEST SENSE command.

Most remaining answers are for ATA/SATA only. SCSI/SAS differs considerably. Some SCSI expert on this list might want to answer.


As I understand from the documentation, fore- and back-ground checks update the self-test error log and certain ATA attributes, when they run.

There is no ATA "self-test error log". On completion of a self-test, a new entry is usually added to the ATA self-test log(s). The ATA error log(s) are typically not updated on read errors found during a self-test.


My main questions are:

-what does SMART RETURN STATUS evaluate?

Anything the author of the drive firmware decided to evaluate :-)

Recent versions of ATA ACS standards say:
"The SMART RETURN STATUS command causes the device to communicate the reliability status of the device to the host."
If command returns failure(0x2c,0xf4): "The device has detected a threshold exceeded condition."

Note that ATA SMART Attributes are not part of the standard. The SMART READ THRESHOLDS command was declared obsolete in ATA-4 (1998).



alternatively stated:
-does the command -only- look at ATA attributes stored in the table and error codes in the SCSI sense buffer? or is the content of the self-test error log also a factor?


SMART RETURN STATUS does not return failure if any Read/Write error occured. It usually will return failure if the number of spare blocks for reallocation is below some threshold.


bottom line:
-If I want to be sure of the health of a disk, can I trust the smart health status (to include the result of the self-tests) or do I have to look at -both- the health status and the self-test error log?

If you want to proactively replace drives, I would recommend to watch the number of reallocated sectors (e.g. use smartd with '-R 9! -r 9!' directive). A failing SMART STATUS may occur (too?) late.



or do I have the wrong angle on this:
-simply watch for a '0' exit code for an "all okay"?


It depends, see above.


minor questions about the exit codes:
-is it possible to have a set bit 3 (device failing) without a set bit 4 (attributes over threshold), and vice versa?

Yes: if SMART RETURN STATUS returned failure but there is no attribute <= threshold in the SMART DATA block, the SMART READ THRESHOLD command did not work, etc...


-at what point does a SCSI drive set the 6th bit in the error code? I have drives (SAS) that have some errors in their smartctl output, but don't set this bit when smartctl is run on them.

For some historic reason, bit 6 was never implemented for SCSI.


-does bit 7 really only work for SATA drives? (SCSI drives have a self test log too)

Yes, it works "better" for ATA because newer long tests without error clear the bit.

Thanks,
Christian




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support