Did I just fail at hot plugging a drive?

I have a small home server built around a Lian Li PC-Q25 case with a SATA backplate advertising itself as hot pluggable. The motherboard is Asus P8H77-I. I have 4 SATA drives attached to the backplate - two pairs built into RAID1 arrays. The system is running on Centos 6.3 x86_64.

One of the drives broke down, so I did the recommended procedure: synced, removed it from the array, shut it down properly and pulled it out. No disaster here, I could hear the drive spin down and no errors appeared in the dmesg log.

Now, I assumed that by the SATA standard the staggered pins in the drive would ensure a safe plug-in without any sudden power surge. Pushing the drive in, I could head the other drives slow down and click their heads for a very brief moment.

Checking the dmesg log revealed the following:

ata1: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen
ata3.00: exception Emask 0x10 SAct 0x3ff007ff SErr 0x4890000 action 0xe frozen
ata3.00: irq_stat 0x08400040, interface fatal error, connection status changed
ata3: SError: { PHYRdyChg 10B8B LinkSeq DevExch }
ata3.00: failed command: WRITE FPDMA QUEUED
ata3.00: cmd 61/80:00:3f:81:ca/00:00:00:00:00/40 tag 0 ncq 65536 out
         res 40/00:54:bf:81:ca/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }

(The last ata3.00 messages are repeated about 20 times with different numbers but the same text)

The last lines are:

ata3.00: status: { DRDY }
ata3: hard resetting link
ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata2: irq_stat 0x00400040, connection status changed
ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }
ata2: hard resetting link
ata1: irq_stat 0x00400040, connection status changed
ata1: SError: { PHYRdyChg 10B8B DevExch }
ata1: hard resetting link
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: configured for UDMA/133
ata2: EH complete
ata3.00: configured for UDMA/133
ata3: EH complete

Also my logwatch reported the following changes in SMART data:

/dev/disk/by-path/pci-0000:00:1f.2-scsi-2:0:0:0 [SAT] :
    Prefailure: Raw_Read_Error_Rate (1) changed to
          100,
    Prefailure: Reallocated_Sector_Ct (5) changed to
      200,
    Prefailure: Spin_Up_Time (3) changed to
      100,
    Usage: Seek_Error_Rate (7) changed to
      200,

 /dev/disk/by-path/pci-0000:00:1f.2-scsi-3:0:0:0 [SAT] :
    Usage: Calibration_Retry_Count (11) changed to
      100,
    Usage: Load_Retry_Count (223) changed to
      100,

Device: /dev/disk/by-path/pci-0000:00:1f.2-scsi-2:0:0:0 [SAT], Self-Test Log error count increased from 0 to 1

On the following day the SMART log still had suspicious entries in it:

 /dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0 [SAT] :
    Usage: Seek_Error_Rate (7) changed to
      200,

 /dev/disk/by-path/pci-0000:00:1f.2-scsi-1:0:0:0 [SAT] :
    Usage: Seek_Error_Rate (7) changed to
      200,

 /dev/disk/by-path/pci-0000:00:1f.2-scsi-2:0:0:0 [SAT] :
    Usage: Multi_Zone_Error_Rate (200) changed to
      200,

 /dev/disk/by-path/pci-0000:00:1f.2-scsi-3:0:0:0 [SAT] :
    Usage: Throughput_Performance (2) changed to
      56,

 /dev/disk/by-path/pci-0000:00:1f.2-scsi-4:0:0:0 [SAT] :
    Prefailure: Raw_Read_Error_Rate (1) changed to
      116, 117,
    Usage: ECC_Uncorr_Error_Count (195) changed to
      116, 117,

So, apparently the sata backend just brutely powered the drive on immediately, possibly causing the voltage to drop for a moment.

My mistake was perhaps to plug all four drives in the same PSU rail and expect the PSU (albeit a 800W Seasonic with good specifications) to cope with the sudden power draw.

The SATA backplate has two Molex connectors on the back for power - I'll plug them to separate PSU rails to ensure a steadier power output.

Is there a way to prevent the drive from spinning up immediately as I stick it back into the drive pack?

Also, did I possibly just damage the drives (can it be seen from these log messages)?

Thank you!

A hard drive withdraw something around 11 Watts, so if your PSU is 800W you should have no problem.

Some big array of hard drive can power up the hard drive in sequence to avoid a potential electrical problem, but it's up to the controller.

Did you try to reboot the server (cold reboot) is everything good then? As you said, you have heard the others drives to spin down and click their head. This is of course not normal. Maybe the hotplug backplane is badly manufactured and a short circuit appeared during the hot plug.

Did I just fail at hot plugging a drive?

Related

Recent Posts