EC2: Relatively small network out spikes cause 100% CPU usage

Background

Hello,

I have a server I'm running on a free EC2 instance. I'm using nginx and passenger/rails as my web server and application server. The server receives little traffic (still in development), but a reasonable amount of traffic from random bots. The server also serves images from S3. The front end is served statically at mywebsite.com defined in one server block, and the backend is served using passenger at api.mywebsite.com.

The Problem

Seemingly randomly, the CPU usage on the server goes to 100%. The CPU usage spikes are correlated with network out spikes, although the network out spikes still seem to be relatively small. When this happens, the front end can no longer be served, and I can't even SSH into the server to check what processes are running.

What I've tried

Blocking malicious bots using this bad-bot-blocker.
Correlating network spikes to requests in the nginx access log at /var/logs/nginx/access.log. Usually the correlation is pretty unclear.
Looking at /var/logs/nginx/error.log for anything relevant.

When the CPU does this, I often end up rebooting the server on the EC2 console which seems to work, but obviously isn't sustainable.

I'm new to deployment stuff/DevOps, so I was wondering if there's anything obvious I might be missing based on this information. I'm not even sure what layer is causing the problem (AWS/nginx/my rails backend/vanilla HTML/JS frontend). If there's any other information I can provide, please let me know.

Thanks,

Jacob

You need to work out what requests are causing the spikes in CPU. Start with your access logs, it may be a small set of IPs, but more likely it's random bots attacking your server. You have to cope with that, it's normal on the internet.

I would start by putting the server behind CloudFlare. Make sure you change your security group to only allow CloudFlare IPs and your private IP to access the server directly. This may block some of the bad actors.

Next you can set up Fail2Ban, and optionally configure fail2ban to block bad actors using the CloudFlare firewall (article link).

EC2: Relatively small network out spikes cause 100% CPU usage

Related

Recent Posts