ZFS - Impact of L2ARC cache device failure (Nexenta)

Solution 1:

ZFS does not do disk I/O, device drivers below ZFS do disk I/O. If the device does not respond in a timely manner, or as in this case, disrupts all other devices on the expander, then it is not visible as a failure to ZFS. All ZFS sees is a slow I/O.

There is a bug in Intel X-25M firmware that affects their behaviour during heavy loads and can cause reset storms. This problem affects all OSes and cannot be solved at the OS layer. Please contact your hardware supplier for fixes or remediation.

If a read is expected to be satisfied by the L2ARC, then the read will be attempted there. ZFS then relies on the lower-layer drivers to report an error. For this case, the drive continues to reset and retry for as many as 5 minutes before declaring the I/O as failed, depending on the driver, device, and default timeout settings. Only after the lower layer drivers declare the I/O as failed will ZFS retry on the pool.

NexentaStor's volume-check and disk-check runners look for additional error messages and alert you via email and fault logging. The disk-check runner has been improved in the 3.1 release to help alert you for specifically the conditions exhibited by broken firmware in SSDs.

Bottom line: your hardware is faulty and will need to be fixed or replaced.

Solution 2:

Are you connecting the X25-M SSD to the backplane? There's a known issue with Nexenta and accessing the L2ARC over a backplane. Your best bet is to connect the SSD directly into a SATA port on the motherboard. Make sure it's configured to use AHCI as well.

If you're running anything mission critical on this server I would switch to a SLC SSD (like the X25-E or a STEC SSD). That being said, you'll probably be ok with the X25-M if it's not.