ZFS and cache devices

http://web.archive.org/web/20100911224754/http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

(The Solaris Internals website no longer up but WebArchive has a copy)


Basically there's three types of ZFS cache, all used for both data and metadata.

  • ARC (Adaptive Replacement Cache) - Main memory DRAM cache for reads and writes.
  • L2ARC (Level 2 ARC) - safe read cache: no data loss/service interruption from device failure. Usually SSD based.
  • ZIL (ZFS Intent Log) - safely holds writes on permanent storage which are also waiting in ARC to be flushed to disk. Data should rarely live in this cache for longer than 30secs and data is never read except after a crash to replay any uncommitted pool writes. On recent any recent ZFS version, Zil device failure won't cause data loss (all data still in ARC), but device failure + a crash or power outage may cause some writes to get lost.

Upgrade your ARC first, buy oodles of main memory. Note L2ARC and Zil both have overhead allocated out of the ARC too.

L2Arc is populated by read-cached blocks as they are evicted from ARC. ZFS by default only caches random IO (small reads) into L2ARC and is not used for streaming workloads (unless instructed to). You can basically use any device for this (including a fast 15k HD) but it works best with an SSD that handles many random read IOPS with ease.

ZIL accelerates workloads which require synchronous writes (processes wait for confirmation that writes have actually been committed to disk before continuing execution). Zil performs a similar role to battery backed cache on high end RAID controllers. Although write latency and streaming write IOPS are what define a good Zil SSD, a Zil above all else mustn't ever loose any data in the event power loss. Many suitable devices have a super-capacitor to finalize any pending operations without system power. SLC SSDs with high write endurance (Intel X25-E) used to be recommended, but newer devices use RAM with battery/supercap to write back to NAND in the event of a power failure. ZILs need not be large, but by using only a small fraction of a large device (e.g. 8GB out of a 300GB Intel 320 MLC SSD) you can yield much higher effective write endurance. 'Enterprise' vendors always recommend mirrored ZILs, my workloads have never been that important.

As for specific products, STEC made the first SSDs for Sun's Fishworks project (both Logzilla & Readzilla) and has current devices for both ZIL (ZeusRAM $2500/8GB) and L2ARC (Zeus IOPS $3k/400GB) which both come highly recommended. PCIe based SSDs are also worth considering, like the ZIL-specific DDRdrive x1 ($2k/4GB) or any big PCIe SSD for L2ARC. Other less performant (read:cheaper) 2.5inch SSD devices can also offer significant performance gains especially when used in aggregate for L2ARC.