Ext4 usage and performance

I've got a cluster of machines running Carbon and Graphite that I need to scale for more storage, but I'm not sure if I need to scale up or out.

The cluster is currently comprised of:

  • 1 Relay Node: Receives all metrics and forwards to the relevant storage node
  • 6 Storage Nodes: Houses all the Whisper DB files

The problem is that it seems like when the disks got in the neighbourhood of 80% usage the performance fell off of a cliff. Cluster write IOPS fell from a near-constant 13k to a more chaotic average of around 7k and IOwait time averages 54%.

I've had a look through our config repo and there are no changes since early April, so this isn't the result of a config change.

Question: Will increasing the disk size bring IO performance back under control, or do I need to add more storage nodes?

Note: No SSDs here, just lots and lots of spindles.

Relevant Graphs:

disk usage iops cpu carbon cache metrics per second

Stats and Stuff:

e2freefrag:

[root@graphite-storage-01 ~]# e2freefrag /dev/vda3
Device: /dev/vda3
Blocksize: 4096 bytes
Total blocks: 9961176
Free blocks: 4781849 (48.0%)

Min. free extent: 4 KB
Max. free extent: 81308 KB
Avg. free extent: 284 KB
Num. free extent: 19071

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
    4K...    8K-  :          4008          4008    0.08%
    8K...   16K-  :          1723          3992    0.08%
   16K...   32K-  :           703          3495    0.07%
   32K...   64K-  :           637          7400    0.15%
   64K...  128K-  :          1590         29273    0.61%
  128K...  256K-  :          4711        236839    4.95%
  256K...  512K-  :          2664        265691    5.56%
  512K... 1024K-  :          2359        434427    9.08%
    1M...    2M-  :           595        213173    4.46%
    2M...    4M-  :            75         49182    1.03%
   64M...  128M-  :             6        118890    2.49%

e4defrag:

[root@graphite-storage-01 ~]# e4defrag -c /dev/vda3
<Fragmented files>                             now/best       size/ext
1. /opt/graphite/storage/graphite.db            17/1              4 KB
2. /var/log/cron                                13/1              4 KB
3. /var/log/wtmp                                16/1              4 KB
4. /root/.bash_history                           4/1              4 KB
5. /var/lib/rpm/Sha1header                      10/1              4 KB

 Total/best extents                             182256/159981
 Average size per extent                        183 KB
 Fragmentation score                            2
 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
 This device (/dev/vda3) does not need defragmentation.
 Done.

iostat:

[root@graphite-storage-01 ~]# iostat -k -x 60 3
Linux 3.10.0-229.7.2.el7.x86_64 (graphite-storage-01)     07/05/2016      _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.99    0.00    2.54   29.66    0.35   59.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00   100.34  177.48 1808.94  2715.66  7659.19    10.45     0.26    0.13    0.65    0.08   0.23  46.14

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.17    0.00    7.00   73.21    0.58   13.04

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    23.87  672.40  656.47  8729.87  2752.27    17.28     7.36    5.50    2.72    8.35   0.73  96.83

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.06    0.00    7.31   73.03    0.59   12.01

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    42.68  677.67  614.88  8634.93  2647.53    17.46     6.66    5.15    2.72    7.83   0.74  96.08

df:

[root@graphite-storage-01 ~]# df
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/vda3       39153856 33689468   3822852  90% /
devtmpfs         1933092        0   1933092   0% /dev
tmpfs            1941380        0   1941380   0% /dev/shm
tmpfs            1941380   188700   1752680  10% /run
tmpfs            1941380        0   1941380   0% /sys/fs/cgroup
/dev/vda2         999320     2584    980352   1% /tmp
[root@graphite-storage-01 ~]# df -i
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/vda3      2490368 239389 2250979   10% /
devtmpfs        483273    304  482969    1% /dev
tmpfs           485345      1  485344    1% /dev/shm
tmpfs           485345    322  485023    1% /run
tmpfs           485345     13  485332    1% /sys/fs/cgroup
/dev/vda2        65536     22   65514    1% /tmp

Edit: I've resized one of the storage nodes, but it's not had an effect. I've also found the cachestat utility in [https://github.com/brendangregg/perf-tools](a collection of perf tools) that's given me a look inside the VFS cache. At this point it looks like I've reached the limit on the IO throughput that my storage can provide.

At this point I think I'm either going to have to continue to scale out to more cluster members, or see about finding a more write-efficient time-series storage solution.

Example output from cachestat:

storage-01 [resized disk]
    HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
    9691    14566     7821    40.0%          160       2628
   36181    14689     7802    71.1%          160       2631
    8649    13617     7003    38.8%          159       2628
   15567    13399     6857    53.7%          160       2627
    9045    14002     7049    39.2%          160       2627
    7533    12503     6153    37.6%          159       2620

storage-02 [not resized]
    HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
    5097    11629     4740    30.5%          143       2365
    5977    11045     4843    35.1%          142       2344
    4356    10479     4199    29.4%          143       2364
    6611    11188     4946    37.1%          143       2348
   33734    14511     5930    69.9%          143       2347
    7885    16353     7090    32.5%          143       2358

Super Late Edit: We've since migrated to another platform where SSDs are available and, while things were good for some time, we eventually saw the same sharp decline in performance as we added more and more metrics. While I don't have any definitive proof I believe that this is a corner case between how Carbon/Whisper storage works, and the sheer number of metrics we store.

Basically, so long as the system has enough RAM to comfortably cache the Whisper files for reads the IO is almost pure write and everything is happy. However, once FS cache starvation sets in and Whisper files need to be continually read in off disk that eats into your IO bandwidth and everything starts going to pot.


Solution 1:

Sounds like you're running SSDs, which can have some funky performance characteristics as they get full. The fact that when the usage dropped around 6/1, performance didn't go back to normal, reinforces that theory.

The reason behind it is all rather complicated, but basically comes down to the need to blank out written-but-currently-unused chunks of flash before it can be written again. It looks like you're writing pretty hard, so the blanking process running in the drive doesn't have a chance to keep up a sufficient supply of blanked chunks once they're all written to once.

Different models of drive have different controllers, and different amounts of "spare" flash chunks to use, and bigger drives obviously have more chunks to write before they run out of fresh bits, so it's almost certain that upgrading to larger drives would "solve" the problem for you, at least temporarily. "Enterprise" grade drives tend to do better in this regard, but so do newer models of flash controller, so it's a bit of a crapshoot, in the absence of reliable third-party testing of a particular drive model in a usage pattern similar to your own.

You might also be able to get away with using the drives you have now for some more time, if you wave something like fstrim over them to tell the drive "you can definitely wipe all of these chunks right now", although doing it on a system you need to be doing other things at the same time might not go down so well (you'll want to note well the performance warnings in the fstrim manpage).

As to whether you need more nodes, I can't say for sure but I don't think so. CPU doesn't look out of control, and I doubt you'd be saturating the I/O system elsewhere.

Solution 2:

Ext3/4 are well know to suffer, from a performance standpoint, with utilization above 80-85%. This is due to increased fragmentation and reduced writeback performance.

Can you provide two iostat -k -x 60 3 output, one when under 80% capacity and one when over 80%?

EDIT: from your e2freefrag it seems /dev/vda3 has plenty of free space. Can you add the output of df and df -i?

Anyway, your iostat results, combined with your graphs (especially "Disk IOPS"), are quite interesting. It seems your workload is very write centric; when >95% of total issued IOPS are writes, you have no problem. However, when your performance degrades, your disks begin serving a consistent of read IOPS. This intermixed reads/writes disrupts the disks' ability to combining multiple smaller writes in bigger ones (reads typically are blocking operations), leading to much slower performance.

For example, let see the fists result shown by iostat: when total disk IOPS are dominated by writes (as in this case), your avgqu-sz and await are both very low.

But in the second and third iostat we see many more reads which, being blocking/stalling operations (see the rrqm/s column: it shows 0, so no reads can be merged in your case), disrupt both latency (await) and throughput (KB/s).

I've seen similar behavior when the host run out of inode cache, maybe due to the sheer number of small files stored. To tune your system to prefer inode/dentry cache at the expense of data cache, try issuing echo 10 > /proc/sys/vm/vfs_cache_pressure and wait some minutes: does it change anything?