SMART - Seek Error Rate
I have read that seek errors is an incremented count of track seeks and that the count resets to zero after a fixed number of thousands of seek commands. This is evident in some of the BackBlaze hard drives (see Figure 1 below).
In Figure 1 the seek rate for the hard drive increases up to and including day 234. The count is then reset on day 235.
Is this incremental count just the total time that the drive has taken to locate a specific piece of stored data?
Does anyone know why this count is reset and if it means anything? I.e. does resetting just reset the count or does it perhaps mean that the disks seek rate is restored to as good as new at day 235?
I am wondering if I can visualise the seek error rate as in Figure 2. Figure 2 (if my understanding is correct) is the total time that the drive has taken to locate a specific piece of stored data without the count rest at day 235. If the count reset does not improve the health of the disk, or if it does not affect the seek rate after the count is reset, then I guess this is fine.
The counters are reset like an odometer rolling over after running out of integers. Many device controllers will have different thresholds, but a 0 count does not mean that the drive is without errors, just as a vehicle with 1,000,010km on the odometer is not "fresh off the assembly line".
If you would like to build a graph as seen in Figure 2, you could write a little data collection utility that reads the SMART information off your storage device and records it in a database (or anywhere you see fit, really). The smartmontools package is the one I usually reach for to display storage device info.
You can install it like this:
-
Open Terminal (if it's not already open)
-
Install the
smartmontools
package:sudo apt install smartmontools
-
Query a storage medium, for example, an NVMe device:
sudo smartctl --all /dev/nvme0n1
This will give you a lot of information:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.0-17-generic] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: SAMSUNG MZVLW512HMJP-000L7 Serial Number: S359NX0K103156 Firmware Version: 7L7QCXY7 PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 512,110,190,592 [512 GB] Unallocated NVM Capacity: 0 Controller ID: 2 NVMe Version: 1.2 Number of Namespaces: 1 Namespace 1 Size/Capacity: 512,110,190,592 [512 GB] Namespace 1 Utilization: 81,254,830,080 [81.2 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 002538 b181b5c4a3 Local Time is: Thu May 27 21:57:29 2021 JST Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg Warning Comp. Temp. Threshold: 69 Celsius Critical Comp. Temp. Threshold: 72 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 7.60W - - 0 0 0 0 0 0 1 + 6.00W - - 1 1 1 1 0 0 2 + 5.10W - - 2 2 2 2 0 0 3 - 0.0400W - - 3 3 3 3 210 1500 4 - 0.0050W - - 4 4 4 4 2200 6000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 33 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 1% Data Units Read: 20,937,566 [10.7 TB] Data Units Written: 26,780,407 [13.7 TB] Host Read Commands: 359,002,242 Host Write Commands: 683,010,154 Controller Busy Time: 5,130 Power Cycles: 1,027 Power On Hours: 3,812 Unsafe Shutdowns: 85 Media and Data Integrity Errors: 0 Error Information Log Entries: 719 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 33 Celsius Temperature Sensor 2: 39 Celsius Error Information (NVMe Log 0x01, 16 of 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 719 0 0x0008 0x4004 - 0 0 - 1 718 0 0x0008 0x4004 - 0 0 - 2 717 0 0x0008 0x4004 - 0 0 - 3 716 0 0x0008 0x4004 - 0 0 - 4 715 0 0x0008 0x4004 - 0 0 - 5 714 0 0x0008 0x4004 - 0 0 - 6 713 0 0x0008 0x4004 - 0 0 - 7 712 0 0x0008 0x4004 - 0 0 - 8 711 0 0x0008 0x4004 - 0 0 - 9 710 0 0x0008 0x4004 - 0 0 - 10 709 0 0x0008 0x4004 - 0 0 - 11 708 0 0x0008 0x4004 - 0 0 - 12 707 0 0x0008 0x4004 - 0 0 - 13 706 0 0x0008 0x4004 - 0 0 - 14 705 0 0x0008 0x4004 - 0 0 - 15 704 0 0x0008 0x4004 - 0 0 - ... (48 entries not read)
This is probably a bit too much information, so you can get just the error counts like this:
sudo smartctl -l error /dev/nvme0n1
The above command returns the same output as seen in the "Error Information" section from the previous command. Note that NVMe devices will return at most 16 entries by default. If you are querying an NVMe device that has more, you can specify the number of entries to return like this:
sudo smartctl -l error,64 /dev/nvme0n1
For my device, there are 64 flash storage chips in total, so I would add
,64
to the command above. You can show information for up to 256 entries.
Hope this gives you a wealth of information to play with and track.