(How) can I use syslog to diagnose mysterious crashes?

I run a LAMP stack on Ubuntu server. Twice today it has ground to a halt and required a hard reboot. Prior to this it had been stable for many months. It has been about a fortnight since the last change to the configuration and even this was minor.

Rummaging around, I found this in syslog a few minutes before a failed cron email alerted me to the outage:

Sep 22 18:14:33 rfb CRON[4912]: (tom) CMD (/usr/share/rfb-scripts/rrdtool-updater.sh)
Sep 22 18:15:40 rfb CRON[4923]: (tom) CMD (/usr/share/rfb-scripts/rrdtool-updater.sh)
Sep 22 18:16:36 rfb CRON[4952]: (tom) CMD (/usr/share/rfb-scripts/rrdtool-updater.sh)
Sep 22 18:16:48 rfb kernel: [16220.076166] apache2 invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Sep 22 18:16:48 rfb kernel: [16220.076173] Pid: 4908, comm: apache2 Not tainted 2.6.32.9-rscloud #6
Sep 22 18:16:48 rfb kernel: [16220.076175] Call Trace:
Sep 22 18:16:48 rfb kernel: [16220.076186]  [<ffffffff8105ca35>] ? T.383+0x78/0x239
Sep 22 18:16:48 rfb kernel: [16220.076191]  [<ffffffff8103dea6>] ? timekeeping_get_ns+0xe/0x2e
Sep 22 18:16:48 rfb kernel: [16220.076195]  [<ffffffff8105cd2a>] ? __out_of_memory+0x134/0x14b
Sep 22 18:16:48 rfb kernel: [16220.076198]  [<ffffffff8105cdab>] ? out_of_memory+0x6a/0x94
Sep 22 18:16:48 rfb kernel: [16220.076202]  [<ffffffff8105f96f>] ? __alloc_pages_nodemask+0x47f/0x56b
Sep 22 18:16:48 rfb kernel: [16220.076206]  [<ffffffff81061112>] ? __do_page_cache_readahead+0x9e/0x1a1
Sep 22 18:16:48 rfb kernel: [16220.076212]  [<ffffffff810384bd>] ? wake_bit_function+0x0/0x2e
Sep 22 18:16:48 rfb kernel: [16220.076214]  [<ffffffff81061231>] ? ra_submit+0x1c/0x20
Sep 22 18:16:48 rfb kernel: [16220.076217]  [<ffffffff8105af33>] ? filemap_fault+0x17e/0x2ff
Sep 22 18:16:48 rfb kernel: [16220.076221]  [<ffffffff8106ea79>] ? __do_fault+0x54/0x335
Sep 22 18:16:48 rfb kernel: [16220.076224]  [<ffffffff8106f008>] ? handle_mm_fault+0x2ae/0x636
Sep 22 18:16:48 rfb kernel: [16220.076228]  [<ffffffff81014211>] ? do_page_fault+0x299/0x2ae
Sep 22 18:16:48 rfb kernel: [16220.076232]  [<ffffffff81226e18>] ? page_fault+0x28/0x30
Sep 22 18:16:48 rfb kernel: [16220.076235] Mem-Info:
Sep 22 18:16:48 rfb kernel: [16220.076237] DMA per-cpu:
Sep 22 18:16:48 rfb kernel: [16220.076238] CPU    0: hi:    0, btch:   1 usd:   0
Sep 22 18:16:48 rfb kernel: [16220.076240] CPU    1: hi:    0, btch:   1 usd:   0
Sep 22 18:16:48 rfb kernel: [16220.076242] CPU    2: hi:    0, btch:   1 usd:   0
Sep 22 18:16:48 rfb kernel: [16220.076244] CPU    3: hi:    0, btch:   1 usd:   0
Sep 22 18:16:48 rfb kernel: [16220.076245] DMA32 per-cpu:
Sep 22 18:16:48 rfb kernel: [16220.076247] CPU    0: hi:  155, btch:  38 usd:  49
Sep 22 18:16:48 rfb kernel: [16220.076249] CPU    1: hi:  155, btch:  38 usd:  37
Sep 22 18:16:48 rfb kernel: [16220.076251] CPU    2: hi:  155, btch:  38 usd:  22
Sep 22 18:16:48 rfb kernel: [16220.076252] CPU    3: hi:  155, btch:  38 usd:  45

The first 3 lines are included to demonstrate normal operation. The log goes on, but I'll spare you that excitement. (If you really want to see it, try this pastebin.)

My question then, is: can I use this log to diagnose my problem? If so, how? And a bonus point for anyone who can tell me what I need to do to prevent it from happening again.


Solution 1:

Agreed, it was apache2 which was the unfortunate process, but it has been the culprit for me before. Usually, especially with perl, php, or mod_python will keep allocating ram for some web application. As different clients hit different apache processes, they will keep increasing in memory utilization.

If your traffic is just high enough to keep apache processes alive, you may end up with 256 apache processes running. But, it doesn't have to be anywhere near this limit, I have had the oom-killer give me a bad day with 30 apache processes using 250-300MB or ram each.

Increasing swap will buy you a little time, to get on the box and see what is going on, but you need to be warned so that you can see what process is eating up ram, verify, if it is indeed apache.

Use free in a cron job or cacti and snmp with thresholds. With free, you will need to watch buffers and free ram, total them and alert on some bottom threshold.

Another thing, if it is apache, might be to bring your MaxClients, it can have a high default number. Or, bring the MaxRequestsPerChild to a number low enough to kill off processes every so often. This is only a band aid, but it could help let you limp along long enough to figure the problem out.

Just my stab in the air Scott M

Solution 2:

something tries to allocate more memory than you have. you could obviously add more swap but that will kill performance of your box. maybe add small script that will run periodically [even more often than once per minute]

date >> /some/file
ps faux >> /some/file

and log the output - you might spot process allocating more and more memory.