How can I tell if a disk is failing on ESXi / what do these errors mean?

I have a server running VMware ESXi v4.1.0 348481. It has a hardware RAID10 and a SATA backup drive. I have a VM running which has it's primary boot vmdk on the RAID10 datastore, and a 600 GB vmdk on the SATA backup drive's datastore. The VM runs Debian linux with the FreeBSD kernel, and uses ZFS for the backup drive.

EDIT: The drive is not directly attached to the VM. It is used as a VMware Datastore, and the VM has a vmdk on the SATA drive's datastore. The datastore is not full (only 65% full)

I logged in to the server using SSH and found that last night backup was hung, and zfs list or zpool list both hung. So I opened the virtual console in ESXi and was sad to see:

This screenshot makes me sad

(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0
(da1:mpt0:0:1:0): CAM status: SCSI Status Error
(da1:mpt0:0:1:0): SCSI status: Check Condition
(da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4 (Unrecovered read error - auto reallocate failed)
(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0
(da1:mpt0:0:1:0): CAM status: SCSI Status Error
(da1:mpt0:0:1:0): SCSI status: Check Condition
(da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4 (Unrecovered read error - auto reallocate failed)

I tried to reboot the VM and I received a message that the system was going down for reboot, and then that hung. (^C appears but does not kill shutdown). I cannot interrupt or kill -9 the zpool list zfs list or rsync processes -- Nothing happens when I try.

  1. Does this ndicate the backup SATA drive is failing? Or could this just be an ESXi error?
  2. How in the vSphere client could I tell if the drive is failing? I didn't see any indication, everything under Hardware Health Status looks good, and I saw nothing under the Storage config.
  3. How should I proceed from here? Should I just hard reboot the VM?

UPDATE: I just hard rebooted the VM. After it came back online, the backup zpool was online, however:

root@timestandstill:/home/jnet# zpool status -v
  pool: backup
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        backup      ONLINE       0     0     0
          da1       ONLINE       0     0     0


errors: Permanent errors have been detected in the following files:

        /backups/someserver/home/someuser/public_html/somedir/calendar/someuser/calendars/somefile.ics

I am leaning heavily towards replacing the drive...


Those errors do indeed indicate one of the SATA drives has encountered enough bad blocks to run out of block-reallocation space. That drive is bad, and needs to be replaced. I'm not sure where in the vSphere client that information is displayed, but the log entries are pretty clear.

If your hardware allows it, a hot-swap should be doable. Otherwise, you're going to have to shut everything down to perform the change-out. If that VM hasn't shut itself down after 30 minutes, it's time to just hard-terminate it. It's risky, but if it is truly hung there isn't much for it.