how to find out what is causing huge dentry_cache usage?

Solution 1:

Late, but maybe useful for others who come upon this.

If you are using the AWS SDK on that EC2 instance, it is highly likely that curl is causing the dentry bloat. While I haven't seen this trigger OOM, it is known to impact the performance of the server, due to the additional work required by the OS to reclaim SLAB.

If you can confirm that curl is being used by your developers to hit https (many of the AWS SDK do this), then the solution is to upgrade the nss-softokn library to at least v3.16.0 and set the environment variable, NSS_SDB_USE_CACHE (YES and NO are valid values, you may have to benchmark to see which performs curl requests more efficiently) for the process which is using libcurl.

I recently ran into this myself and wrote a blog entry (old blog entry link and upstream bug report) with some diagnostics & more detailed information, in case that helps.

Solution 2:

You have a few options. If I were in this situation I would start tracking the stats in:

# cat /proc/sys/fs/dentry-state 
87338   82056   45      0       0       0

Over time to see how fast it is growing. If the rate is somewhat regular I think you could identify possible culprits in two ways. First looking at the output of lsof might indicate that some process is leaving around deleted file handles. Second, you could strace the main resource using applications and look for an excessive number of fs related calls (like open(), stat(), etc).

I am also curious about @David Schwartz's comment. I haven't seen issues where the dentry cache causes the oom to kill things, but maybe that happens if they are all still referenced and active? If that is the case I'm pretty confident lsof would expose the issue.