Cause of page fragmentation on "large" server with xfs, 20 disks and Ceph
I thought I'd put an answer with my observations because there are a lot of comments.
Based off of your output at https://gist.github.com/christian-marie/7bc845d2da7847534104
We can determine the following:
- The GFP_MASK for the memory allocation tried is allowed to do the following.
- Can access emergency pools (I think this means access data below the high watermark for a zone)
- Dont use emergency reserves (I think this means dont allow access to memroy below the min watermark)
- Allocate from one of the normal zones.
- Can swap in order to make room.
- Can drop caches in order to make room.
Zone fragmentation is location here:
[3443189.780792] Node 0 Normal: 3300*4kB (UEM) 8396*8kB (UEM) 4218*16kB (UEM) 76*32kB (UEM) 12*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 151056kB
[3443189.780801] Node 1 Normal: 26667*4kB (UEM) 6084*8kB (UEM) 2040*16kB (UEM) 96*32kB (UEM) 22*64kB (UEM) 4*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 192972kB
And memory utilization at the time is here:
[3443189.780759] Node 0 Normal free:149520kB min:40952kB low:51188kB high:61428kB active_anon:9694208kB inactive_anon:1054236kB active_file:7065912kB inactive_file:7172412kB unevictable:0kB isolated(anon):5452kB isolated(file):3616kB present:30408704kB managed:29881160kB mlocked:0kB dirty:0kB writeback:0kB mapped:25440kB shmem:743788kB slab_reclaimable:1362240kB slab_unreclaimable:783096kB kernel_stack:29488kB pagetables:43748kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[3443189.780766] Node 1 Normal free:191444kB min:45264kB low:56580kB high:67896kB active_anon:11371988kB inactive_anon:1172444kB active_file:8084140kB inactive_file:8556980kB unevictable:0kB isolated(anon):4388kB isolated(file):4676kB present:33554432kB managed:33026648kB mlocked:0kB dirty:0kB writeback:0kB mapped:45400kB shmem:2263296kB slab_reclaimable:1606604kB slab_unreclaimable:438220kB kernel_stack:55936kB pagetables:44944kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
The fragmentation of each zone is bad in the page allocation failure output. There are lot of free order 0 pages with much fewer to none higher order pages. A 'good' result will be plentiful free pages along each order, gradually getting lower in size the higher the order you go. Having 0 high order pages 5 and above indicates fragmentation and starvation for high order allocations.
I am currently not seeing a convincing degree of evidence to suggest that the fragmentation during this period is anything to do with slab caches. In the resulting memory stats, we can see the following
Node 0 = active_anon:9694208kB inactive_anon:1054236kB
Node 1 = active anon:11371988kB inactive_anon:1172444kB
There are no huge pages assigned from userspace, and userspace will thus always claim order 0 memory. Thus in both zones altogether there is over 22GiB of defragmentable memory.
Behaviours I cannot Explain
When high order allocations fail, it is my understanding that memory compaction is always attempted in order to allow for regions of high-order memory allocation to take place and succeed. Why does this not happen? If it does happen, why can it not find any memory to defragment when there is 22GiB of it ripe for reordering?
Behaviours I think I can explain
This needs more research to understand properly, but I believe the ability for the allocation to automatically swap/drop some pagecache to succeed probably does not apply here because there is a lot of free memory still available, so no reclaims occur. Just not enough in the higher orders.
Whilst theres lots of free memory and a few order 4 requests left in each zone, the "total all free memory for each order and deduct from the real free memory" issue results in a 'free memory' below the 'min' watermark which is what leads to the actual allocation failure.