smartd error [ATA error count increased]
I got a mail to root, (same error in syslog)
The following warning/error was logged by the smartd daemon:
Device: /dev/bus/0 [megaraid_disk_12] [SAT], ATA error count increased from 0 to 9
Device info:
ST13000NM0005-2A1201, S/N:ZVJ2GZSP, WWN:5-000c50-0b3bg13c3, FW:SN02, 12.0 TB
After running two tests on it,
$ smartctl -l selftest /dev/bus/0 -d megaraid,12
/dev/bus/0 [megaraid_disk_12] [SAT]: Device open changed type from 'megaraid,12' to 'sat+megaraid,12'
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 11445 -
# 2 Short offline Completed without error 00% 11427 -
IN storecli,
$ sudo storcli /call show
Generating detailed summary of the adapter, it may take a while to complete.
Controller = 0
Status = Success
Description = None
Virtual Drives = 3
VD LIST :
=======
------------------------------------------------------------
DG/VD TYPE State Access Consist Cache sCC Size Name
------------------------------------------------------------
0/0 RAID1 Optl RW Yes RWBD - 223.062 GB
1/1 RAID1 Optl RW Yes RWTD - 3.492 TB
2/2 RAID10 Optl RW Yes RWBD - 32.740 TB
------------------------------------------------------------
Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency
Physical Drives = 12
PD LIST :
=======
-----------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
-----------------------------------------------------------------------------------
0:0 1 Onln 0 223.062 GB SATA SSD N N 512B INTEL SSDSC2******* U
0:1 2 Onln 0 223.062 GB SATA SSD N N 512B INTEL SSDSC2******* U
0:2 6 Onln 1 3.492 TB SATA SSD N N 512B Samsung SSD 883 DCT 3.84TB U
0:3 10 Onln 1 3.492 TB SATA SSD N N 512B Samsung SSD 883 DCT 3.84TB U
0:4 8 Onln 2 10.913 TB SATA HDD N N 512B ST13000NM0005-2A1201 U
0:5 11 Onln 2 10.913 TB SATA HDD N N 512B ST13000NM0005-2A1201 U
0:6 3 Onln 2 10.913 TB SATA HDD N N 512B ST13000NM0005-2A1201 U
0:7 4 Onln 2 10.913 TB SATA HDD N N 512B ST13000NM0005-2A1201 U
0:8 7 Onln 2 10.913 TB SATA HDD N N 512B ST13000NM0005-2A1201 U
0:9 12 Onln 2 10.913 TB SATA HDD N N 512B ST13000NM0005-2A1201 U
0:10 9 GHS - 10.913 TB SATA HDD N N 512B ST13000NM0005-2A1201 D
0:11 5 GHS - 10.913 TB SATA HDD N N 512B ST13000NM0005-2A1201 D
-----------------------------------------------------------------------------------
EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|F-Foreign
UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: (1125) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 082 064 044 Pre-fail Always - 157163520
3 Spin_Up_Time 0x0003 090 090 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 11
5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always - 9208
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 920073154
9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 11564
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 11
187 Reported_Uncorrect 0x0032 091 091 000 Old_age Always - 9
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 071 048 040 Old_age Always - 29 (Min/Max 24/52)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 25
193 Load_Cycle_Count 0x0032 093 093 000 Old_age Always - 15951
194 Temperature_Celsius 0x0022 029 052 000 Old_age Always - 29 (0 22 0 0 0)
195 Hardware_ECC_Recovered 0x001a 082 064 000 Old_age Always - 157163520
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0023 100 100 001 Pre-fail Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2824 (96 42 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3015256
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3036621419213
SMART Error Log Version: 1
ATA Error Count: 9 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 9 occurred at disk power-on lifetime: 11377 hours (474 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
42 00 00 ff ff ff 4f 00 39d+08:08:45.869 READ VERIFY SECTOR(S) EXT
42 00 01 ff ff ff 4f 00 39d+08:08:45.176 READ VERIFY SECTOR(S) EXT
35 00 01 ff ff ff 4f 00 39d+08:08:45.175 WRITE DMA EXT
42 00 00 ff ff ff 4f 00 39d+08:08:29.406 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:08:29.190 READ VERIFY SECTOR(S) EXT
Error 8 occurred at disk power-on lifetime: 11377 hours (474 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
42 00 00 ff ff ff 4f 00 39d+08:08:29.406 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:08:29.190 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:08:29.119 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:08:28.814 READ VERIFY SECTOR(S) EXT
42 00 01 ff ff ff 4f 00 39d+08:08:28.143 READ VERIFY SECTOR(S) EXT
Error 7 occurred at disk power-on lifetime: 11377 hours (474 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
42 00 00 ff ff ff 4f 00 39d+08:08:18.534 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:08:10.326 READ VERIFY SECTOR(S) EXT
42 00 01 ff ff ff 4f 00 39d+08:08:09.645 READ VERIFY SECTOR(S) EXT
35 00 01 ff ff ff 4f 00 39d+08:08:09.645 WRITE DMA EXT
42 00 00 ff ff ff 4f 00 39d+08:08:00.404 READ VERIFY SECTOR(S) EXT
Error 6 occurred at disk power-on lifetime: 11377 hours (474 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
42 00 00 ff ff ff 4f 00 39d+08:08:00.404 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:08:00.226 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:08:00.083 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:07:57.845 READ VERIFY SECTOR(S) EXT
42 00 01 ff ff ff 4f 00 39d+08:07:57.215 READ VERIFY SECTOR(S) EXT
Error 5 occurred at disk power-on lifetime: 11377 hours (474 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
42 00 00 ff ff ff 4f 00 39d+08:07:37.571 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:07:37.554 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:07:37.536 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:07:37.519 READ VERIFY SECTOR(S) EXT
42 00 00 ff ff ff 4f 00 39d+08:07:32.546 READ VERIFY SECTOR(S) EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 11445 -
# 2 Short offline Completed without error 00% 11427 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Can you please help in understanding and getting the failed disk if any. Thanks
Yes, this 10 Tb disk, with serial no. ZVJ2GZSP
, is dying. I wonder why initial report said it is 16 Tb.
It already used over 9000 sectors in reserved area:
5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always - 9208
I think you can remove it safely, because all storage on this RAID controller is redundant (RAID1 or RAID 10) and all is Optimal. You have to set this disk to Offline. Once disk is set offline one of GHS (global hot spares) should get up and replace this disk in its array.
This disk is in the enclosure bay no. 9 (remember, these are 0-based, so it is 10th one). However, to be certain, you might use locate function of your enclosure.
For details see controller manual.