zfs send -i / receive stalling

On a Solaris 11.1 install, I see stalls when receiving a zfs incremental stream. The streams are originating from a Solaris 11.0 install, are created using zfs send -i and piped through mbuffer. At some point in time, I can see the write performance stall occasionally (virtually no read or write activity on the destination disks, mpstat reporting utilization on the same or different cores summing up to slightly above 100% for "system", with 0 other load):

root@receiver:~# mpstat 5
[...]
 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    0   0    0   363  103  213    0    9    4    0    14    0   6   0  94
   1    0   0    0   113    3  149    0   11    6    0    16    0  12   0  88
   2    1   0    2    40    4   36    0    9    5    0     3    0   1   0  99
   3    0   0    0    82   59   18    0    2    3    0     3    0  93   0   7
 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    3   0    0   362  104  207    0    9    6    0    12    0  14   0  86
   1    0   0    0    90    4  109    0   10    7    0     3    0  17   0  83
   2    0   0    0    48    2   40    0    9    3    0     5    0  10   0  90
   3    0   0    0   100   54   44    0    7    3    0     4    0  69   0  31
 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    0   0    0   332  103   35    0    6    3    0     0    0  45   0  55
   1    0   0    0    27    3   22    0    3    1    0     0    0  45   0  55
   2    0   0    0   142    3  172    0    9    6    0     4    0  18   0  82
   3    0   0    0   181   56  178    0   10    6    0     8    0   1   0  99
 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    0   0    0   327  103   54    0    5    2    0     2    0  49   0  51
   1    0   0    0    46    3   52    0    9    3    0     0    0  28   0  72
   2    0   0    0   156    2  175    0    9    5    0     4    0  25   0  75
   3    0   0    0   153   62  132    1    8    6    0     5    0   3   0  97
 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    0   0    1   380  103  164    0   11    9    0     7    0   5   0  95
   1    0   0    0   144    3  165    0   11    9    0     6    0  25   0  75
   2    0   0    0    39    1   36    0    6    4    0     2    0  66   0  34
   3    0   0    0   125   77   55    0   10    6    0     1    0  14   0  86
 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    0   0    0   372  107  178    0    9    9    0    19    0   3   0  97
   1    0   0    0   145    3  193    0    9   10    0     8    0   8   0  92
   2    0   0    0    24    2   26    0    9    3    0     0    0   2   0  98
   3    0   0    0    94   86    3    0    0    5    0     0    0 100   0   0

After another 200-300 MBs written in bursts, the transfer nearly comes to a halt:

root@receiver:~# zpool iostat tank 5
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.85T  2.01T    511    831  7.76M  4.60M
tank        1.85T  2.01T      0     26      0  65.4K
tank        1.85T  2.01T      3     25  13.2K  62.4K
tank        1.85T  2.01T      4     34  5.00K  76.2K
tank        1.85T  2.01T      3     32  3.10K  87.0K
tank        1.85T  2.01T      0     25  1.20K  59.0K
tank        1.85T  2.01T      6     58  4.30K   339K
tank        1.85T  2.01T      1     28  3.70K  63.1K
tank        1.85T  2.01T     19     30  36.9K  83.2K

And the mbuffers at the receiving and the sending ends are at 100% utilization.

This only seems to happen with larger (> 5G) snapshots, I do see the stalls towards the end of the stream, after the first GBs have been successfully transferred in a reasonable time. The stream reception is still working, just very slowly - the data rate is below 100 KB/s (from the receiving side mbuffer's memory to the idle disks).

I also have tried taking mbuffer out of the equation and pipe the zfs send stream through SSH directly into the zfs receive, but it seems to make no significant difference. The snapshot eventually gets transferred, but the last 1-2 gigs would take hours.

The receiving host is otherwise completely idle, no messages are printed to the console, the kernel message log or /var/log/syslog at that time. The destination zpool remains usable - I still can access other filesystems located in the same zpool at all times. Also, receiving complete, non-incremental / non-recursive zfs streams of hundreds of GB in size works without any preceptable issues at wire speed.

Is there a known problem with 11.1 regarding this issue? How could I further diagnose what the receiver is waiting for when it should write the stream?


Solution 1:

Both the zfs send and the zfs receive are tied to each other when you pipe them together like this. The source system has to crawl the zfs metadata looking for blocks that were written within the incremental interval you are sending. You're then piping that to mbuffer, so that the stream across the ssh session can be somewhat optimized by presenting a bucket on each end that can mitigate stalling and overruns. Then the pipe from mbuffer feeds data into a zfs receive which has to process the incoming data exactly as if it were being written. So, X number of transactions per transaction group, flush to disk, calculate the metadata, and write it all out, all the way up to the uberblock. This can look very much like stalls in the stream, and can generally last from 5 to 30 seconds. If these drops in throughput are lasting longer than 30 seconds, you're probably resource bound somewhere.

Depending on how your system is tuned, for example, do you have a fast ZIL SLOG? Or do you have a very large number of spindles behind your zpool to optimize on logbias=throughput? Depending on the answers to questions like that, you'll be able to determine if you are resource bound somewhere or not.

Your CPUS don't look too badly beat up. I see servers on a daily basis with ZPOOLs sized at 250+TB that have mpstat intr column counts up over 20,000. More cpus always improves mpstat numbers.

I would look at some of the dtrace scripts like zilstat, arcstat, and iopattern among others (check the DtraceToolkit) to see what the system is doing during your pauses.