GCE stackdriver logging agent (fluentd) memory leak with COS

I have a VM on GCE where I run a custom Docker image. I installed it on COS (cos-stable-74-11895-125-0) to a g1-small (1 vCPU, 1.7 GB memory) instance.

It is an Elasticsearch with locked memory settings. It consumes exactly 1 GB of RAM.

The setup worked perfectly for nearly one year, but suddenly it stopped working as it runs out of memory.

On the serial console, it logs that the oom-killer was invoked and it selected the java process to be killed. After a quick restart, it works perfectly, but after about a day or two, it fails with the same out of memory error.

It turned out, that GCE installs and runs a Stackdriver Logging Agent in parallel with my container. According to the oom-killer log, a related ruby process consumes a bunch of RAM. After a reboot, it uses 50 MB of RAM and it stacks up to 300MB in a few hours. From my point of view, it seems to be a memory leak.

To make it clear: The ES server is not receiving any load, just a periodic uptime check is making requests about every 5 minutes. As a result, there are no vast amounts of logs, only a few lines per hour. A whole year of logs would not take up 300MB of storage or memory.

I tried to examine the memory footprint using the ps and top commands but the results were totally implausible. According to the system, the ruby task consumes 1-10GB of physical memory (RSS) and up to 70GB of virtual memory (VSZ). (Memory swapping is disabled according to the free command.) These figures can't be true as the VM does not have these amounts of RAM.

I tried to update the OS to a more recent version (cos-81-12871-119-0). It definitely helped as the ES was running for more than 5 days now, without a problem. But the ruby process's memory consumption is worrying. According to the ps and top commands, it uses 70-300MB of physical memory (RSS) and up to 7GB of virtual memory (VSZ).

I found that the ruby process is part of the fluentd tool which had a similar memory problem according to their GitHub. As the configuration and installation are automatically done by GCE I did not found a way to change its settings.

The easiest solution would be to disable the installation of the logging agent by changing the VM metadata. (google-logging-enabled=false). But I would like to find out the reason why this happens and how can I solve it.

I'm really curious if anybody had a similar problem, and what was the solution.

Stackdriver Logging Agent Docker image: gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1

The error can be reproduced. I have a development and a production environment in separate GCP projects. The error appeared on both, roughly at the same time. (~2020. May. 28.) According to this, I think this is caused by a change in Stackdriver but I did not found any evidence in the logs. (Both VM instances are in eu-west3 region, in c zone.)

I would appreciate any help or advice. Thanks!


Solution 1:

For anyone experiencing similar problems:

cos-77-12371-1109-0 running on g1-small uses memory as follows:

MiB Mem :   1692.4 total,   1409.3 free,    176.0 used,    107.1 buff/cache

Cloud Monitoring agent has it's own memory requirements:

A minimum of 250 MiB of resident (RSS) memory is recommended to run the Monitoring agent.

On top of that, there's docker:

It is an Elasticsearch with locked memory settings. It consumes exactly 1 GB of RAM.

So 1409-1000-250=159 MB left, which is less than 10% margin.

To avoid out-of-memory issues, consider disabling fluentd or create VM with more RAM.
If you think you found a bug with GCP, you can file an issue on Public Issue Tracker.