Better performance when HDD write cache is disabled? (HGST Ultrastar 7K6000 and Media Cache behavior)

Attention, please. Long read.
During initial performance tests of Hitachi Ultrastar 7K6000 drives that I'm planning to use in my Ceph setup I've noticed a strange thing: write performance is better when disk write cache is disabled.


I use fio:

fio --filename=/dev/sda --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=4krandw

When write cache is disabled:

hdparm -W 0 /dev/sda 

4krandw: (groupid=0, jobs=1): err= 0: pid=6368: Thu Jun 22 07:36:44 2017
write: io=63548KB, bw=1059.9KB/s, iops=264, runt= 60003msec
clat (usec): min=473, max=101906, avg=3768.57, stdev=11923.0

When write cache is enabled:

hdparm -W 1 /dev/sda

4krandw: (groupid=0, jobs=1): err= 0: pid=6396: Thu Jun 22 07:39:14 2017
write: io=23264KB, bw=397005B/s, iops=96, runt= 60005msec
clat (msec): min=1, max=48, avg=10.30, stdev= 4.12

Relevant hardware details:

  • Server: Supermicro 5018D8-AR12L
  • Storage controller: LSI2116 IT mode (integrated SW solution) works without any caching or logical volume management
  • Disks: Hitachi Ultrastar 7K6000 4Tb (HUS726040ALE614)
  • OS: Ubuntu 16.04.2, kernel 4.4.0-81-generic

Unfortunately, I can not think of any reasonable explanation for this behaviour, quick summary:

  • Write cache disabled: 264 IOPS, 3.768ms commit latency (high std deviation, though)
  • Write cache enabled: 96 IOPS, 10.3ms commit latency

UPD: I have tested the disk with direct connection to a SATA port on the motherboard (separate SATA controller, not LSI2116) and nothing has changed, the same results. So, I presume, that's not a SW LSI2116 controller that cause strange results.

UPD2: Interestingly, performance gain for sequential operations when cache is disabled is lower, but stable. Here's an example:

fio --filename=/dev/sdl --direct=1 --sync=1 --rw=write --bs=16M --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=16M-wr 


Write cache enabled:

16M-wr: (groupid=0, jobs=1): err= 0: pid=2309: Fri Jun 23 11:52:37 2017
  write: io=9024.0MB, bw=153879KB/s, iops=9, runt= 60051msec
    clat (msec): min=86, max=173, avg=105.37, stdev= 9.64

Write cache disabled:

16M-wr: (groupid=0, jobs=1): err= 0: pid=2275: Fri Jun 23 11:45:22 2017  
  write: io=10864MB, bw=185159KB/s, iops=11, runt= 60082msec
    clat (msec): min=80, max=132, avg=87.42, stdev= 6.84

And this becomes interesting because difference between results when cache enabled and disabled is exactly what HGST claims in their datasheet:
https://www.hgst.com/sites/default/files/resources/Ultrastar-7K6000-DS.pdf

•Compared to prior generation 7K4000
...
Up to 3X faster random write performance using media cache technology
25% faster sequential read/write performance

It still does not explain why performance is better with write cache disabled, however, it indeed looks like when write cache is enabled, I get performance comparable to prev. generation 7K4000. Without write cache random write performance is 2.6x faster and sequential is 1.2X faster.

UPD3 hypotesis: Newer Hitachi Ultrastar drives has a feature called Media Cache. It is an advanced non-volatile caching technique and here's how it works (as I understand it, of course):

  • First data is written into DRAM cache
  • Next, drive has many reserved areas on each platter physically located in areas providing the best speed. These areas are essentially Media Cache storage. So, these areas are used as non-volatile second stage cache. The data from DRAM buffer is accumulated and flushed with high queue depth into Media Cache. This allows to minimize head movements and provides additional reliability and speed gain.
  • And only after that data is written to the actual storage areas on the platter.

So, Media Cache is a two-stage writeback cache and I think that write operation is considered complete only after flush to the Media Cache is done.
Interesting technique, I must admit. My hypotesis is that when we disable write caching with hdparm -W0, only media cache is disabled.
Data is being cached only in DRAM and then directly flushed to the platters. Although, Media Cache for sure should provide great advantage, during synchronous writes we have to wait for writes to the Media Cache area. And when Media Cache is disabled write is considered complete after data is written into disk DRAM buffer. Much faster. On lower queue depths DRAM cache provides enough space to write without speed degradation, however, on larger queues, when MANY flushes to the platter has to happen constantly the situation is different. I have performed two tests with QD=256.

fio --filename=/dev/sda --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=256 --runtime=180 --time_based --group_reporting --name=4krandwrite

hdparm -W0 /dev/sda (write cache disabled)
4krandwrite: (groupid=0, jobs=1): err= 0: pid=3176: Wed Jun 28 10:11:15 2017
  write: io=62772KB, bw=357093B/s, iops=87, runt=180005msec
    clat (msec): min=1, max=72, avg=11.46, stdev= 4.95

hdparm -W1 (write cache enabled)
4krandwrite: (groupid=0, jobs=1): err= 0: pid=3210: Wed Jun 28 10:14:37 2017
  write: io=70016KB, bw=398304B/s, iops=97, runt=180004msec
    clat (msec): min=1, max=52, avg=10.27, stdev= 3.99

So, we clearly see that enabling write cache gives 11.5% advantage in IOPS and commit latency. Looks like my hypotesis is correct and hdparm controls only Media Cache, but not DRAM buffer. And on higher queue depths MC really pays for itself

This is not the case for sequential operations, though.

fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=16M --numjobs=1 --iodepth=256 --runtime=180 --time_based --group_reporting --name=16Mseq

hdparm -W0 /dev/sda (write cache disabled)
16Mseq: (groupid=0, jobs=1): err= 0: pid=3018: Wed Jun 28 09:38:52 2017
  write: io=32608MB, bw=185502KB/s, iops=11, runt=180001msec
    clat (msec): min=75, max=144, avg=87.27, stdev= 6.58

hdparm -W1 /dev/sda (write cache enabled)
16Mseq: (groupid=0, jobs=1): err= 0: pid=2986: Wed Jun 28 09:34:00 2017
  write: io=27312MB, bw=155308KB/s, iops=9, runt=180078msec
    clat (msec): min=83, max=165, avg=104.44, stdev=10.72

So, I guess, Media Cache provides speed advantage on random write loads, for sequential writes it may be used mainly as additional reliability mechanism.



UPD4 (Looks like I've got an answer)
I have contacted HGST support and they have clarified that on 7K6000 media cache is active only when write cache (DRAM) is disabled. So, it looks like on low queue depths Media Cache is actually faster than DRAM cache. I guess, this is because Media Cache allows to write data sequentially into it's cache areas irrespectively of IO pattern. That greatly minimizes required HDD head movements and leads to better performance. I still would like to know more about Media Cache, so, I am not answering my own question yet. Instead, I've asked support for more technical info on Media Cache. Will update this question with more info if I get any.


I still will greatly appreciate any suggestions, comments or alternative explanations. Thanks in advance!


Solution 1:

It seems that recent HGST drives behave differently, with hdparm -W0|1 controlling both the DRAM cache and MediaCache. Moreover, MediaCache seems active on WCE/W1 (cache enabled) rather than on WCD/W0 (cache disabled).

Let see how this HGST HUS722T2TALA604 disk behaves on some fio runs.

disabled caches (hdparm -W0) and direct writes

[root@singularity ~]# fio --name=test --filename=/dev/sda --io_size=4M --direct=1 --rw=randwrite
...
write: IOPS=73, BW=295KiB/s (302kB/s)(4096KiB/13908msec)
...

disabled caches (hdparm -W0), direct + sync writes

[root@singularity ~]# fio --name=test --filename=/dev/sda --io_size=4M --direct=1 --sync=1 --rw=randwrite
...
write: IOPS=73, BW=295KiB/s (302kB/s)(4096KiB/13873msec)
...

enabled caches (hdparm -W1), direct + sync writes

[root@singularity ~]# fio --name=test --filename=/dev/sda --io_size=4M --direct=1 --sync=1 --rw=randwrite
...
write: IOPS=127, BW=510KiB/s (523kB/s)(4096KiB/8027msec)
...

Considerations:

  1. from the direct vs direct+sync with disabled caches we can see that hdparm -W0 disables both the DRAM buffer and MediaCache; otherwise, the direct results would be significantly higher than the direct+sync ones. These results are perfectly in line with the performance of a seek-constrained 7200 RPM disk, at ~70 IOPS.

  2. with enabled caches, performance are much better, with IOPS almost doubling. As sync prevents caching in the DRAM buffer alone, it means that MediaCache is working here.

So, while some other NVRAM technologies operates on WCD/WC0 (write cache disabled) disk setting, it seems that MediaCache requires WCE/WC1 to works.