Site Goes Offline Every Day At Midnight - No One Knows Why

Solution 1:

You need to look at more logs. Check /var/log/messages at around midnight (and perhaps /var/log/messages.0, /var/log/messages.1, etc. for previous nights). Look at your http.conf to find where your apache logs are stored (that file should be in /etc/http/conf). The ErrorLog directive in that file will tell you where your apache error logging is going (typically an error_log file somewhere). Look at that file to see what it reports around midnight. Check other files in /var/log for unusual activity you can correlate. Logfiles should tell you why the webserver is failing at midnight.

Solution 2:

According to the 'hits per hour' graph that you posted, you get 13,000+ requests in the midnight hour. This is your highest hour by far. When you do a 'service httpd restart' you see a warning message about your MaxClients exceeding your ServerLimit and it's lowering your MaxClients to 200. This means that you're allowing 200 httpd clients. Your httpd clients are consuming about 40M each. 200 * 40 = 8GB. Mysql is also taking up 300MB. The OS needs some too. You have no swap configured. Your disk cache is at zero at this time according to the 'top' output that you've posted, but there is a lot of memory free. That's kinda weird and it's throwing me for a loop.

Linux might be implementing the OOM killer. Check dmesg output for those signs. I'd suggest lowering your MaxClients and/or increasing the amount of RAM (or possibly adding CPU power.) You can also look in your apache logs to find out what is hitting your site at this hour. If it is legitimate traffic then increasing the RAM/CPU is the way to go. If it isn't, then mitigation is the path to take.

Solution 3:

Are you being spidered too aggressively?

Check your Apache logs and try making some adjustments to your robots.txt:

User-agent: BadBot
Disallow: /

Cheers

Solution 4:

May I suggest that you set up cron jobs that perform periodic monitoring during that time? Set up a script that outputs the CPU usage, memory usage, etcetera during that time of your services. You might also want to add a ping to that periodic script so that you can ensure that your server has a working network connection during the outage. The last thing I'd add to that periodic diagnostic script is a wget request to your site during the downtime, across the localhost interface.

It's possible that other systems at your hosting provider may be causing these problems - it may not be your server at all. Hopefully building a script to run server-side can give you additional diagnostic information, and help you to find the cause of the problem.

Is your server virtual? It's possible that your provider performs various snapshotting (from DomU) at that time which may freeze the other domains.