HD dying?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

HD dying?

David Niklas
Hello,
First of all, I *have* been backing up my data.
I'm going to post LOTS of details here, fell free to skim.

My problem is that once upon a time my drived failed < 6 months after I
bought my laptop. Sending it to a professional did not help, nor did
replacing the PCB, it was dead.
The symptoms leading up to the event was a sudden freeze of the OS. I was
not too bright about Linux at the time, so I thought that perhaps X froze.
Now I'm getting the identical thing, a sudden freeze. I can ping the
kernel, I cannot restore the frame buffer, sync, or umount the
file systems. My syslog metalog records no messages during this period,
it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq
to reboot. I'm using OpenRC.

This has happened twice or three times.
I just ran a self test and it says PASSED, I'm not seeing anything that
stands out.

smartmontools-6.4
Gentoo Linux 4.9.x

Below is my S.M.A.R.T. data. BTW: it is unwrapped.
What do you think?
Thanks, David


=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile
Device Model:     WDC WD7500BPVX-22JC3T0
Serial Number:    WD-WXC1A14E1823
LU WWN Device Id: 5 0014ee 209f3d675
Firmware Version: 01.01A01
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Apr  4 14:47:54 2017 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection: (13920) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time: (   2) minutes.
Extended self-test routine
recommended polling time: ( 157) minutes.
Conveyance self-test routine
recommended polling time: (   5) minutes.
SCT capabilities:       (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   196   179   021    Pre-fail  Always       -       1166
  4 Start_Stop_Count        0x0032   058   058   000    Old_age   Always       -       42367
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13532
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1695
191 G-Sense_Error_Rate      0x0032   001   001   000    Old_age   Always       -       124
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       158
193 Load_Cycle_Count        0x0032   183   183   000    Old_age   Always       -       51774
194 Temperature_Celsius     0x0022   107   091   000    Old_age   Always       -       40
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     13529         -
# 2  Extended offline    Completed without error       00%     12288         -
# 3  Extended offline    Completed without error       00%      9247         -
# 4  Extended offline    Completed without error       00%      7609         -
# 5  Extended offline    Completed without error       00%      5469         -
# 6  Short offline       Completed without error       00%         0         -
# 7  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HD dying?

Carlos E. R.
On 2017-04-06 02:26, David Niklas wrote:

> Hello,
> First of all, I *have* been backing up my data.
> I'm going to post LOTS of details here, fell free to skim.
>
> My problem is that once upon a time my drived failed < 6 months after I
> bought my laptop. Sending it to a professional did not help, nor did
> replacing the PCB, it was dead.
> The symptoms leading up to the event was a sudden freeze of the OS. I was
> not too bright about Linux at the time, so I thought that perhaps X froze.
> Now I'm getting the identical thing, a sudden freeze. I can ping the
> kernel, I cannot restore the frame buffer, sync, or umount the
> file systems. My syslog metalog records no messages during this period,
> it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq
> to reboot. I'm using OpenRC.
>
> This has happened twice or three times.
> I just ran a self test and it says PASSED, I'm not seeing anything that
> stands out.
>
> smartmontools-6.4
> Gentoo Linux 4.9.x
>
> Below is my S.M.A.R.T. data. BTW: it is unwrapped.
> What do you think?
No evidence of problem here, that I can see.

If it were the disk, you typically would see messages of the kernel
complaining in "dmesg".

--
Cheers / Saludos,

                Carlos E. R.

  (from 42.2 x86_64 "Malachite" (Minas Tirith))


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

signature.asc (220 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HD dying?

Robert Spotswood
In reply to this post by David Niklas
I'll second "Carlos E. R."'s verdict. I see nothing wrong either. However,
that does not guarantee there isn't something wrong. Somewhere I read a
study that said SMART only predicts about 60% of hard drive failures. The
other 40% give no warning.

Backups are always a good idea. They protect not only against hard drive
failures, but also accidental or malicious data loss. Now have you tested
those backups? I remember when I was free-lance going to a brand new
client (first visit). They needed me to do a restore. OK, got their backup
media (it was back in the zip disk days). Every disk was write-protected,
and blank. Needless to say, that day didn't go well.

> Hello,
> First of all, I *have* been backing up my data.
> I'm going to post LOTS of details here, fell free to skim.
>
> My problem is that once upon a time my drived failed < 6 months after I
> bought my laptop. Sending it to a professional did not help, nor did
> replacing the PCB, it was dead.
> The symptoms leading up to the event was a sudden freeze of the OS. I was
> not too bright about Linux at the time, so I thought that perhaps X froze.
> Now I'm getting the identical thing, a sudden freeze. I can ping the
> kernel, I cannot restore the frame buffer, sync, or umount the
> file systems. My syslog metalog records no messages during this period,
> it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq
> to reboot. I'm using OpenRC.
>
> This has happened twice or three times.
> I just ran a self test and it says PASSED, I'm not seeing anything that
> stands out.
>
> smartmontools-6.4
> Gentoo Linux 4.9.x
>
> Below is my S.M.A.R.T. data. BTW: it is unwrapped.
> What do you think?
> Thanks, David
>
>
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Blue Mobile
> Device Model:     WDC WD7500BPVX-22JC3T0
> Serial Number:    WD-WXC1A14E1823
> LU WWN Device Id: 5 0014ee 209f3d675
> Firmware Version: 01.01A01
> User Capacity:    750,156,374,016 bytes [750 GB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    5400 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2 (minor revision not indicated)
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:    Tue Apr  4 14:47:54 2017 UTC
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x00) Offline data collection activity
> was never started.
> Auto Offline Data Collection: Disabled.
> Self-test execution status:      (   0) The previous self-test routine
> completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (13920) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: (   2) minutes.
> Extended self-test routine
> recommended polling time: ( 157) minutes.
> Conveyance self-test routine
> recommended polling time: (   5) minutes.
> SCT capabilities:       (0x7035) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always
>     -       0
>   3 Spin_Up_Time            0x0027   196   179   021    Pre-fail  Always
>     -       1166
>   4 Start_Stop_Count        0x0032   058   058   000    Old_age   Always
>     -       42367
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always
>     -       0
>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always
>     -       0
>   9 Power_On_Hours          0x0032   082   082   000    Old_age   Always
>     -       13532
>  10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always
>     -       0
>  11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always
>     -       0
>  12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always
>     -       1695
> 191 G-Sense_Error_Rate      0x0032   001   001   000    Old_age   Always
>     -       124
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always
>     -       158
> 193 Load_Cycle_Count        0x0032   183   183   000    Old_age   Always
>     -       51774
> 194 Temperature_Celsius     0x0022   107   091   000    Old_age   Always
>     -       40
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always
>     -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always
>     -       0
> 198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline
>     -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always
>     -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline
>     -       0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed without error       00%     13529
>   -
> # 2  Extended offline    Completed without error       00%     12288
>   -
> # 3  Extended offline    Completed without error       00%      9247
>   -
> # 4  Extended offline    Completed without error       00%      7609
>   -
> # 5  Extended offline    Completed without error       00%      5469
>   -
> # 6  Short offline       Completed without error       00%         0
>   -
> # 7  Short offline       Completed without error       00%         0
>   -
>
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute
> delay.
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Smartmontools-support mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/smartmontools-support
>



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HD dying?

Robin H. Johnson
In reply to this post by David Niklas
On Wed, Apr 05, 2017 at 08:26:12PM -0400, David Niklas wrote:
> The symptoms leading up to the event was a sudden freeze of the OS. I was
> not too bright about Linux at the time, so I thought that perhaps X froze.
> Now I'm getting the identical thing, a sudden freeze. I can ping the
> kernel, I cannot restore the frame buffer, sync, or umount the
> file systems. My syslog metalog records no messages during this period,
> it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq
> to reboot. I'm using OpenRC.
Metalog would only be useful is writes to disk were succeeding. It's
certainly possible for the kernel to hang in such a state that there is
kernel panic, and writes to disk are not happening (this includes
sending the sysrq-sync command).

That you can ping the kernel simply says that there's enough left
running for the kernel to handle ICMP without going to userspace.

That you can't SSH says something in userspace failed (which could be a
myriad of reasons).

Just because the system seems to freeze does not mean that the drive is
faulty. Also entirely possible there is a logged drive event in dmesg
that you can't see.

If you can repeat it, consider some of the following to get a better
insight as to what's going on.
- set up serial kernel console or network kernel console logging.
- set up kdump or similar.

That's not to say that the drive isn't the source of the problem, just
that it's not likely based on the output you've shown.

You say this is a laptop, and the drive by power hours has racked up
~1.5 years of usage, so it possibly hasn't been opened in at least that
long. How much dust has built up inside it? Overheating of the graphics
CAN cause the symptoms you've described.

--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
E-Mail   : [hidden email]
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HD dying?

David Niklas
On Thu, 6 Apr 2017 20:51:51 +0000
"Robin H. Johnson" <[hidden email]> wrote:

> On Wed, Apr 05, 2017 at 08:26:12PM -0400, David Niklas wrote:
> > The symptoms leading up to the event was a sudden freeze of the OS. I
> > was not too bright about Linux at the time, so I thought that perhaps
> > X froze. Now I'm getting the identical thing, a sudden freeze. I can
> > ping the kernel, I cannot restore the frame buffer, sync, or umount
> > the file systems. My syslog metalog records no messages during this
> > period, it is set to sync the dmesg messages. I cannot ssh, but I can
> > uses sysreq to reboot. I'm using OpenRC.  
> Metalog would only be useful is writes to disk were succeeding. It's
> certainly possible for the kernel to hang in such a state that there is
> kernel panic, and writes to disk are not happening (this includes
> sending the sysrq-sync command).
>
> That you can ping the kernel simply says that there's enough left
> running for the kernel to handle ICMP without going to userspace.
>
> That you can't SSH says something in userspace failed (which could be a
> myriad of reasons).
>
> Just because the system seems to freeze does not mean that the drive is
> faulty. Also entirely possible there is a logged drive event in dmesg
> that you can't see.
>
> If you can repeat it, consider some of the following to get a better
> insight as to what's going on.
> - set up serial kernel console or network kernel console logging.
> - set up kdump or similar.
No, It's random so far.

> That's not to say that the drive isn't the source of the problem, just
> that it's not likely based on the output you've shown.
Why not?
What else causes all writes to the drive to stop except a problem with
the drive or MB (my laptop has not cabling)?

> You say this is a laptop, and the drive by power hours has racked up
> ~1.5 years of usage, so it possibly hasn't been opened in at least that
> long. How much dust has built up inside it? Overheating of the graphics
> CAN cause the symptoms you've described.
The laptop is my primary way to get online, it's not be left off for more
than 2 days unless it's HW failed (the original drive died).


So, I'm not misreading the S.M.A.R.T. data? No values that aught to be
interpreted in HEX, OCTAL or something?


Thanks,
David

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HD dying?

Robin H. Johnson
On Fri, Apr 07, 2017 at 04:28:43PM -0400, David Niklas wrote:
...
> > If you can repeat it, consider some of the following to get a better
> > insight as to what's going on.
> > - set up serial kernel console or network kernel console logging.
> > - set up kdump or similar.
> No, It's random so far.
Ok, get yourself network console logging, since networking was still
working, and you can just let the kernel send a copy of all klog entries
over the network.

See in the kernel sources, see Documentation/networking/netconsole.txt
or examples in the Ubuntu & Arch wikis.

> > That's not to say that the drive isn't the source of the problem, just
> > that it's not likely based on the output you've shown.
> Why not?
> What else causes all writes to the drive to stop except a problem with
> the drive or MB (my laptop has not cabling)?
Most failure modes of a spinning drive would cause various error
counters to be incremented. The few that I could think of that wouldn't
involve specific component failures on the drive PCB.

Drive PCB-originating failures should NOT cause your video to lock up,
but may stop the logging to disk of any errors.

I can start up a linux system, running off a sata drive, open a
terminal, suddenly disconnect the drive, and still be able to run dmesg
and/or see live kernel log entries (Provided that dmesg itself is at
least already cached and running doesn't need anything to be read off
disk).

So what we're looking for as root cause is some manner of error that
causes both video & drive to become unresponsive, but the kernel to
still respond to ICMP ping (ergo network stack is operational).

That root cause COULD have other effects (like a power spike that then
damages the drive PCB), but it's the root cause we care about.

Overheating causing a component fault (like causing a capacitor to go
out of tolerance or fail) on one of the PCI/PCIe busses, and therein
affecting the drive & graphics. The networking might be on a different
bus, and continues to function.

> > You say this is a laptop, and the drive by power hours has racked up
> > ~1.5 years of usage, so it possibly hasn't been opened in at least that
> > long. How much dust has built up inside it? Overheating of the graphics
> > CAN cause the symptoms you've described.
> The laptop is my primary way to get online, it's not be left off for more
> than 2 days unless it's HW failed (the original drive died).
>
> So, I'm not misreading the S.M.A.R.T. data? No values that aught to be
> interpreted in HEX, OCTAL or something?
No, the drive data seems good, and representative of a health &
well-used drive. No reallocated sectors, no other issues, not that many
power cycles even for a laptop drive w/ aggressive power saving.

--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
E-Mail   : [hidden email]
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Smartmontools-support mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Loading...