What's the first thing you check when an untouched unix server starts going berserk?
Solution 1:
First Order: Is it responsive?
If you can't log in, there's bigger problems afoot. This generally comes in two flavors: hardware failure, and software failure. Both are potentially catastrophic. To prevent DFA errors, check the general hardware health first - a simple glance-over usually will suffice.
Second Order: Are the system's underlying structures in good health and order?
Check the "Golden Triad" of systems:
- Enough CPU time is free for processing
- Enough disk space is free for storage
- Enough memory is free for workloads
In the last few decades, the triad has expanded into a "quad" which includes communications (networking):
- Connectivity is functional, responsive, and has capacity
Third Order: What is the severity of the issue?
What programs or services are affected? In decreasing order of severity, is it systemic (system-wide), clustered (a group of programs), or isolated (a specific program)? Clusters of programs typically are tripping up because a specific underlying service has failed or gone unresponsive. Systemic issues are sometimes related to this (think DNS or IP conflicts) but knowing where to look is usually the key.
Fourth Order: Are diagnostic tools providing useful data relevant to the issue? Now that you have info about the health of the system (second order) and what parts of it are experiencing issues (third order) this should make it easy to narrow down where the problem is.
Error messages or log files should be a common waypoint on this journey.
CPU issues:
- loadav
- top
- strace
Disk space / I-O issues:
- df
- du
- lsof
- iostat
- vmstat
Memory issues:
- free
Connectivity issues:
- ping
- route (and arp and rarp and friends)
- iptables, ipchains, ipfw (for those BSD folks out there)
- traceroute or mtr
- hosts, nslookup, or dig
- netstat
Most common complaint (that I hear):
Email is not delivering fast enough (more than a minute from send to receipt by recipient) or, email is rejecting my attempt to send. This usually comes down to the rate limiter in Postfix kicking in during a spam-storm, which impacts the ability to accept internal delivery.
A real-life example:
However, this is not always the case. One time, the issue persisted regardless of service restarts; so after 3 minutes it was time to start looking around. CPU was busy but under 100%, yet the load had soared to 15 on a box of just 2 cores, and was threatening to go higher. The top command revealed that the mail system was in overdrive, along with the mail scanner, but there were no amavis child processes to be seen. That was the clue - the mail queue command (mailq) showed some 150+ undelivered messages, over 80% of which were spam, in the last 20 minutes. A quick adjustment to lower the rate limiter (which reduced the intake rate of the spam storm) while increasing the number of child email scanner processes (to help process the backlog), followed by a service restart, resolved the issue and the system was able to complete deliveries in a short time.
The cause of the problem was that the amavis parent process had keeled over dead, and the child processes had eventually all run their course (they self-terminate after so many scans to prevent memory leaks). So there were SMTP processes in postfix attempting to contact...thin air...to do the spam/virus scanning that was needed. The distro I was using had out-of-date packages that would never be updated; as the installation was due to be replaced in a year or so, I manually "overrode" the install to the latest version, which included several bug fixes. I haven't had the same problem since.
Solution 2:
usually "who" followed by "last"
a heap of issues on machines I've managed over times have been because of a very loose definition of "untouched" - often someone has done something :)
Solution 3:
Well, I'll start.
This one bit me once, I spent hours trying thousands of different things, disabling services here and there, rebooting, etc. What was the problem? Totally out of disk space.
So, here's the first thing I type when debugging a suddenly troubled server:
df -h
I never forget that now. It just saved me lots of wasted effort. Thought I'd share.
Solution 4:
top (or htop)
Solution 5:
If you can I would always try shutting down all NICs bar the management one.