Give me my KILL-power when something goes wrong

We've had it a few times now. Suddenly our production server won't respond because a process is in an infinite loop, or the MySQL server stops serving new requests because one query is blocking everything...

We SSH to the server and use ps aux or top to find the culprit, or mytop or SHOW FULL PROCESSLIST in MySQL to find the offending process ID and kill it. Then offcourse we try to recreate the situation on the testserver and fix the bug.

But sometimes the server is so well hung your ps aux / top / mytop / SHOW FULL PROCESSLIST won't go through - even the admins are blocked.

What is the best way to ensure an admin can always access the server and kill offending processes or queries (both on Linux and MySQL)?

  • Can we allocate priorities to different users?
  • Reserve a part of the resources for root?

I've checked nice(1), but constantly having an open connection with nice -20 seems a bit excessive and difficult to work with (let alone dangerous as root).


Solution 1:

The pam_limits.so module it's a nifty tool to limit memory, open files, ... and to set nice priority for users and groups.

rpm -ql pam | grep limits
man limits.conf
less /etc/security/limits.conf

Solution 2:

http://en.wikipedia.org/wiki/Magic_SysRq_key

Solution 3:

We use Dell servers that have a remote access network card (DRAC) installed that allows us to access the server out of band via ssh or a web browser. We can get to a console screen, or power cycle the server. Most major server vendors support some similar device.

This doesn't help you if you want to log into a server that has 0 resources available to allow a login. Short of reserving resources for a log in, this is the next best thing to physical access to the server.

It sounds like you have issues surrounding problem applications. Why do you have apps that are going into infinite loops and MySQL queries that are exhausting your server resources?