Inode tables shrinking sharply over time causing rsync/inode problems
We setup a Linux (it's on Amazon AWS, a CentOS-like system although we are not exactly sure the customizations done on it) system with 4TB storage as an XFS volume over LVM (ultimately to be used for serving over NFS4, but it is not in use yet), and we are in the process of using rsync to sync files from a production NFS server of ours to the XFS volume (i.e. we rsync from a source over NFS to a locally mounted XFS-based LVM volume). However, we observed that at some point in the middle rsync started to become increasingly sluggish (throughput sharply reduced) and both load average and memory consumption rose by a large extent (and CPU has very large proportion in iowait). Eventually I rebooted the XFS system and the system apparently resumed to normal, with more normal rsync performance since, at least for the past 24 hours.
We checked the munin monitoring graphs and didn't notice anything obvious, but we found that the "Inode table size" and "open inode" metrics (checked the munin plugin implementation which points to the values as being read from /proc/sys/fs/inode-nr) kept decreasing over time. Shortly before we observed rsync getting stuck, we observed both metrics were down to the value of a few thousands from several hundred thousands (our non-XFS servers stay at roughly 500k most of the time and do not show any monotonic decreasing trend over extended periods), and we observed logs from the kernel like these:
ip-XX-XXX-XXX-XXX login: [395850.680006] hrtimer: interrupt took 20000573 ns Sep 18 17:19:58 ip-XX-XXX-XXX-XXX kernel: [395850.680006] hrtimer: interrupt took 20000573 ns [400921.660046] INFO: task rsync:7919 blocked for more than 120 seconds. [400921.660066] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [400921.660077] rsync D ffff880002fe4240 0 7919 7918 0x00000000 [400921.660093] ffff8800683e5638 0000000000000282 ffff880000000000 0000000000014240 [400921.660131] ffff8800683e5fd8 0000000000014240 ffff8800683e5fd8 ffff88000726da40 [400921.660153] 0000000000014240 0000000000014240 ffff8800683e5fd8 0000000000014240 [400921.660176] Call Trace: [400921.660202] [] schedule_timeout+0x1fd/0x270 [400921.660220] [] ? pvclock_clocksource_read+0x58/0xd0 [400921.660234] [] ? __raw_callee_save_xen_irq_enable+0x11/0x26 [400921.660247] [] __down+0x76/0xc0 [400921.660262] [] down+0x3b/0x50 [400921.660274] [] ? _raw_spin_unlock_irqrestore+0x19/0x20 [400921.660314] [] xfs_buf_lock+0x2b/0x80 [xfs] [400921.660338] [] _xfs_buf_find+0x139/0x230 [xfs] [400921.660360] [] xfs_buf_get+0x5b/0x160 [xfs] [400921.660378] [] xfs_buf_read+0x13/0xa0 [xfs] [400921.660401] [] xfs_trans_read_buf+0x197/0x2c0 [xfs] [400921.660422] [] xfs_read_agi+0x6f/0x100 [xfs] [400921.660443] [] xfs_ialloc_read_agi+0x29/0x90 [xfs] [400921.660467] [] xfs_ialloc_ag_select+0x12b/0x280 [xfs] [400921.660485] [] xfs_dialloc+0x3c7/0x870 [xfs] [400921.660500] [] ? pvclock_clocksource_read+0x58/0xd0 [400921.660509] [] ? __raw_callee_save_xen_restore_fl+0x11/0x1e [400921.660531] [] xfs_ialloc+0x60/0x6a0 [xfs] [400921.660550] [] ? xlog_grant_log_space+0x39c/0x3f0 [xfs] [400921.660566] [] ? xen_spin_lock+0xa5/0x110 [400921.660583] [] xfs_dir_ialloc+0x7d/0x2d0 [xfs] [400921.660606] [] ? xfs_log_reserve+0xe2/0xf0 [xfs] [400921.660623] [] xfs_create+0x3f7/0x600 [xfs] [400921.660638] [] ? __raw_callee_save_xen_restore_fl+0x11/0x1e [400921.660655] [] xfs_vn_mknod+0xa2/0x1b0 [xfs] [400921.660678] [] xfs_vn_create+0xb/0x10 [xfs] [400921.660689] [] vfs_create+0xa7/0xd0 [400921.660701] [] do_last+0x529/0x650 [400921.660714] [] ? get_empty_filp+0x75/0x170 [400921.660728] [] do_filp_open+0x213/0x670 [400921.660744] [] ? xen_spin_lock+0xa5/0x110 [400921.660753] [] ? __raw_callee_save_xen_restore_fl+0x11/0x1e [400921.660769] [] ? alloc_fd+0x102/0x150 [400921.660780] [] do_sys_open+0x64/0x130 [400921.660792] [] ? __raw_callee_save_xen_irq_disable+0x11/0x1e [400921.660804] [] sys_open+0x1b/0x20 [400921.660815] [] system_call_fastpath+0x16/0x1b
We also observed drastic increase in "lookup" operation as seen on the source NFS when that happened, which previously remained stable before we started to experience the said rsync issue.
We have not observed similar behaviour on our production volumes that are ext3-based and in fact those were with even larger volume sizes. Other than the filesystem difference, the file servers are on similar machine class and setup in a similar manner. As we found that the inode table metrics on the XFS server just now are still on the decreasing trend similar to our earlier observation even though we have just rebooted it yesterday, I am concerned the same issue will haunt us again soon, and may likely reflect some issues with our setup, kernel or whatever.
We are on inode64-mounted XFS volumes on x86_64 architecture machines when we experienced this. Right now we have copied about 1.3TB of data to the XFS volume whose capacity is approximately 4TB and we should have around 3TB of data in that volume if fully copied. The volume was created anew so has been inode64-mounted from the very beginning when there was no data inside, so the filesystem should be clean and inodes evenly distributed.
Any insights as to what might be the cause?
(p.s. in fact we started seeing this again since a few hours ago!)
Solution 1:
Enabling XFS delayed logging (delaylog
mount option) might help (see http://en.wikipedia.org/wiki/XFS#Disadvantages). CentOS is (in)famous for using a patched kernel so a kernel upgrade may be needed.
Solution 2:
polynomial and AndreasM said what's naturally comes to mind : it looks like a thrashing situation, you did not have enough memory.
Rsync collects the file list to transfer in a 1st pass, and 1/ traversing a large hierarchy over NFS is slow (local lstat()
translated as remote NFS getattr
are slow and non cachable since you're only traversing once), 2/ this problem depends on the number of inodes (use df -i
), not on the total amount of data to transfer.
Note that using rsync -H|--hard-links
is even more expensive, rsync must build a full hash tables of all inodes to find dupes.
Try to use rsync right from the file system exported by the NFS server, bypassing NFS altogether. Depending on your NFS latency, this might be a nice boost.
In some edge cases where traversing a large collection of inodes was actually more expensive that simply copying the data, I used something like ssh source 'tar -cC /path/to/src .' | tar -xC /path/to/dest
which is a simple streaming copy which has a constant memory usage. Depending on your CPU+network setup, adding compression might speedup the whole operation - or not (add -z
on both tar invocations).