FreeNAS with ZFS and TLER/ERC/CCTL

I am currently in the process of building a new storage server, to be used for virtual machines, files and backup. The OS is FreeNAS, which uses ZFS as software RAID.

My problem is that, I need to choose hard drives and I have looked at both consumer and enterprise hard drives, but i am faced with an unanswered question, which I have not been able to find a clear answer for.

Can I use consumer hard drives, that do not support TLER / ERC / CCTL, with ZFS (software RAID), without getting in trouble later on or do i need to go with enterprise hard drives that supports TLER / ERC / CCTL?

There are a lot of different opinions about this and some say you should use it and some say you should not. I know a couple of OS which relies on software raid, that have their own error correction timeout and therefore does not care if there is TLER / ERC / CCTL. I am aware, that you need to use TLER / ERC / CCTL, when dealing with hardware RAID.

I really hope, that someone can shed some light on my problem.

Thanks.


Can I use consumer hard drives, that do not support TLER / ERC / CCTL, with ZFS (software RAID), without getting in trouble later on or do i need to go with enterprise hard drives that supports TLER / ERC / CCTL?

Imagine each of your drives is a black box with certain features, lifetime etc. All of them work independently inside your storage array. Therefore, you must look at each of them independently to see what would happen in different cases.

Example

For this question, let's say you have a pool consisting of 2 mirrored drives A and B. The possible combinations are:

  1. A and B have TLER
  2. A has it, B does not
  3. A does not have it, but B does
  4. A and B do not have it

If everything works fine on all disks, there is no problem.

One error:

If disk A experiences an error when trying to read a block, it is this situation for each possible case:

  1. The system asks the disk for the block. The disk tries to find it again and again, and after about 7 to 9 seconds (whatever the TLER timeout is) the controller drops the disk from the array. The system noticies that one disk is missing and does whatever you have provisioned (raise an email alert, begin resilvering of a hotspare, do nothing and continue degraded, ...)
  2. Same as 1.
  3. The system asks the disk for the block. The disk tries to find it again and again, until its own timeout is reached or until the block is found. This timeout may be several minutes and may be different for each different model and manufacturer. If the block can be retrieved, it is read normally; if it fails, an error message for the block itself is returned.
  4. Same as 3.

Two errors (one on each disk):

This case is very rare, but theoretically possible. Both disks experience an error on exactly the same block:

  1. Both disks will be dropped and your pool becomes unavailable.
  2. Same as 3 from above.
  3. Same as 3 from above.
  4. Same as 3 from above.

Performance vs reliability

As you have seen above, you have to make a choice depending on your goal and pool layout.

  • Use TLER disks if you always need minimal response time and cannot accept long blocking lag. The downside is you have to potentially provision additional disks (Z2 instead of Z1, Z3 instead of Z2, 3-mirrors instead of 2-mirrors) or hotspares to achieve the same average pool health in the end. Also you have to provision for potentially long rebuild times which may affect your overall performance negatively.
  • Use non-TLER disks if budget or space is constrained and lag is acceptable if disks are preserved. As ZFS already helps you with self-healing for affected blocks, you do not need what TLER was originally envisioned for (controller must drop disks so that self-healing could start)
  • Set the disk timeout to a value that your application/architecture can comfortably handle. This way you can have enterprise disks without dropping or consumer disks with dropping as you prefer. Not all disks are modifiable, so check online first before buying.

Also worth considering are the figures Backblaze published about enterprise drive lifetimes, and consumer drive reliability.

Although they use a custom Reed-Solomon implemention, their figures (and business model) suggest that consumer drives are definitely able to provide good reliability, and with a suitable error detection/recovery algorithm, can provide good protection for your data. Certainly, their entire setup seems to have fared pretty well without any of these enterprise features.

So as @user121391 says (I may be paraphrasing a little), ZFS on consumer disks should be fine, unless you have specific needs that would require enterprise features.