zfs send -i / receive stalling
On a Solaris 11.1 install, I see stalls when receiving a zfs incremental stream. The streams are originating from a Solaris 11.0 install, are created using zfs send -i
and piped through mbuffer
. At some point in time, I can see the write performance stall occasionally (virtually no read or write activity on the destination disks, mpstat
reporting utilization on the same or different cores summing up to slightly above 100% for "system", with 0 other load):
root@receiver:~# mpstat 5
[...]
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 363 103 213 0 9 4 0 14 0 6 0 94
1 0 0 0 113 3 149 0 11 6 0 16 0 12 0 88
2 1 0 2 40 4 36 0 9 5 0 3 0 1 0 99
3 0 0 0 82 59 18 0 2 3 0 3 0 93 0 7
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 3 0 0 362 104 207 0 9 6 0 12 0 14 0 86
1 0 0 0 90 4 109 0 10 7 0 3 0 17 0 83
2 0 0 0 48 2 40 0 9 3 0 5 0 10 0 90
3 0 0 0 100 54 44 0 7 3 0 4 0 69 0 31
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 332 103 35 0 6 3 0 0 0 45 0 55
1 0 0 0 27 3 22 0 3 1 0 0 0 45 0 55
2 0 0 0 142 3 172 0 9 6 0 4 0 18 0 82
3 0 0 0 181 56 178 0 10 6 0 8 0 1 0 99
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 327 103 54 0 5 2 0 2 0 49 0 51
1 0 0 0 46 3 52 0 9 3 0 0 0 28 0 72
2 0 0 0 156 2 175 0 9 5 0 4 0 25 0 75
3 0 0 0 153 62 132 1 8 6 0 5 0 3 0 97
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 1 380 103 164 0 11 9 0 7 0 5 0 95
1 0 0 0 144 3 165 0 11 9 0 6 0 25 0 75
2 0 0 0 39 1 36 0 6 4 0 2 0 66 0 34
3 0 0 0 125 77 55 0 10 6 0 1 0 14 0 86
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 372 107 178 0 9 9 0 19 0 3 0 97
1 0 0 0 145 3 193 0 9 10 0 8 0 8 0 92
2 0 0 0 24 2 26 0 9 3 0 0 0 2 0 98
3 0 0 0 94 86 3 0 0 5 0 0 0 100 0 0
After another 200-300 MBs written in bursts, the transfer nearly comes to a halt:
root@receiver:~# zpool iostat tank 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.85T 2.01T 511 831 7.76M 4.60M
tank 1.85T 2.01T 0 26 0 65.4K
tank 1.85T 2.01T 3 25 13.2K 62.4K
tank 1.85T 2.01T 4 34 5.00K 76.2K
tank 1.85T 2.01T 3 32 3.10K 87.0K
tank 1.85T 2.01T 0 25 1.20K 59.0K
tank 1.85T 2.01T 6 58 4.30K 339K
tank 1.85T 2.01T 1 28 3.70K 63.1K
tank 1.85T 2.01T 19 30 36.9K 83.2K
And the mbuffers at the receiving and the sending ends are at 100% utilization.
This only seems to happen with larger (> 5G) snapshots, I do see the stalls towards the end of the stream, after the first GBs have been successfully transferred in a reasonable time. The stream reception is still working, just very slowly - the data rate is below 100 KB/s (from the receiving side mbuffer
's memory to the idle disks).
I also have tried taking mbuffer out of the equation and pipe the zfs send
stream through SSH directly into the zfs receive
, but it seems to make no significant difference. The snapshot eventually gets transferred, but the last 1-2 gigs would take hours.
The receiving host is otherwise completely idle, no messages are printed to the console, the kernel message log or /var/log/syslog at that time. The destination zpool remains usable - I still can access other filesystems located in the same zpool at all times. Also, receiving complete, non-incremental / non-recursive zfs streams of hundreds of GB in size works without any preceptable issues at wire speed.
Is there a known problem with 11.1 regarding this issue? How could I further diagnose what the receiver is waiting for when it should write the stream?
Solution 1:
Both the zfs send and the zfs receive are tied to each other when you pipe them together like this. The source system has to crawl the zfs metadata looking for blocks that were written within the incremental interval you are sending. You're then piping that to mbuffer, so that the stream across the ssh session can be somewhat optimized by presenting a bucket on each end that can mitigate stalling and overruns. Then the pipe from mbuffer feeds data into a zfs receive which has to process the incoming data exactly as if it were being written. So, X number of transactions per transaction group, flush to disk, calculate the metadata, and write it all out, all the way up to the uberblock. This can look very much like stalls in the stream, and can generally last from 5 to 30 seconds. If these drops in throughput are lasting longer than 30 seconds, you're probably resource bound somewhere.
Depending on how your system is tuned, for example, do you have a fast ZIL SLOG? Or do you have a very large number of spindles behind your zpool to optimize on logbias=throughput? Depending on the answers to questions like that, you'll be able to determine if you are resource bound somewhere or not.
Your CPUS don't look too badly beat up. I see servers on a daily basis with ZPOOLs sized at 250+TB that have mpstat
intr
column counts up over 20,000. More cpus always improves mpstat numbers.
I would look at some of the dtrace scripts like zilstat
, arcstat
, and iopattern
among others (check the DtraceToolkit) to see what the system is doing during your pauses.