What can cause ALL services on a server to go down, yet still responding to ping? and how to figure out
It has happened to me already twice within very few days that my server goes down completely, meaning http, ssh, ftp, dns, smtp, basically ALL services stop responding, as if the server had been turned off, except it still responds to ping, which is what most buffles me.
I do have some php scripts that cause a huge load (cpu and memory) on the server in short bursts, used by a little group of users, but usually the server "survives" perfectly well to these bursts, and when it goes down it never coincide with such peaks in usage (I'm not saying it can't be related, but it doesn't happen just after those).
I'm not asking you to magically be able to tell me the ultimate cause of these crashes, my question is: is there a single process whose death may cause all these services to go down simultaneously? The funny thing is that all network services go down, except ping. If the server had 100% of the CPU eaten up by some process, it wouldn't respond to ping either. If apache crashed because of (for example) a broken php script, that would affect http only, not ssh and dns.... etc.
My OS is Cent OS 5.6
Most importantly, after hard-rebooting the server, what system logs should I look at? /var/log/messages doesn't reveal anything suspicious.
(tl;dr still responding to ping is an expected behaviour, check your memory usage)
ICMP echo requests (i.e. ping) are handled by the in-kernel networking stack, with no other dependency.
The kernel is known as being "memory resident", which means it will always be kept in RAM, and can't be swapped to disk like a regular application can.
This means in situations where you run of out of physical memory applications are swapped to disk, but the kernel remains where it is. When both the physical and swap memory are full (and the system can no long manage your programs) the machine will fall-over. However because a) the kernel is still in memory and b) it can respond to ping requests without the help of anything else, the system will keep responding to ping despite everything being dead.
In regard to your problem I'd strongly suspect memory issues. Install "sysstat" and use the "sar" command to see a log of memory/cpu/load/io load etc. I would expect at the times of crash you'd see both 100% physical and swap used.
I would also consider looking at dmesg or /var/log/messages for any sign of the OOM-killer (out-of-memory-killer) being invoked. This is the kernel's emergency system which will start killing processes in the event of memory being exhausted. It's effectiveness depends largely on what processes are being killed. A single process eating up the memory will be efficiently killed and memory freed, however an apache-based website will spawn replacement processes as soon a child process is killed.
Usually, it's an I/O or disk subsystem issue. Often times, this will be coupled with an extremely-high system load average. For example, the system detailed in the graph below became unresponsive (yet was pingable) when a script ran awry, locked a bunch of files and the load rose to 36... on a 4-CPU system.
The services that are running in RAM and do not require disk access continue to run... Thus, the network stack (ping) is up, but the other services stall when disk access is required... SSH when a key is referenced or password lookup needed. SMTP tends to shut down when load average hits 30 or so...
When the system is in this state, try a remote nmap
against the server's IP to see what's up.
Your logging probably doesn't work if this is a disk or storage issue...
Can you describe the hardware setup? Is this a virtual machine? What is the storage layout?
More than logging, you want to see if you can graph the system performance and understand when this is happening. See if this correlates to a specific activity.