How do you administer 20 or more Linux Servers daily?
I am researching about the most proficient way of administering 20 Linux servers and 100 Linux Workstations on a centralized basis.
I am not sure if there is some Administration and Monitoring Suite to achieve the daily administration work and troubleshooting from one single station.
I have one site with forty workstations and about fifteen compute nodes.
I manage the workstations by:
- forcing engineers to store all data on the NFS network, not locally
- not letting any engineer have root on any workstation for any reason
- having all systems syslog to a central syslog-ng host, with log parsing happening at regular intervals (usually daily, but sometimes as frequently as hourly)
- monitor up/down with nagios
- having a repeatable kickstart environment -- rule of thumb is if a problem can't be fixed in thirty minutes, the machine gets re-kickstarted (in practice, we actually kickstart much faster than that because in this setup there's rarely thirty minutes of troubleshooting we can do), and if the kickstart fails we start swapping hardware
I manage the compute farms pretty much the same way, except:
- there is a local /scratch directory where anyone can write anything -- however the contents of that directory are not guaranteed
- performance/usage counters are done through munin from a central host
- network activity is done by using cacti to monitor the switch ports the farm nodes are connected to
It isn't perfect, but it has kept this one site going.
(Oh, I should mention that this site doesn't have any on-site full-time IT people, it is PT and on-demand. The monitoring systems above can usually let you know when there's a computer in distress.)
On the automation from you have several options including
- cfengine
- puppet (currently what I use)
- chef
On the monitoring side I would suggest either Icinga or Nagios which are basically identical.
Hope this helps. The real thing to do is plan out just what you want to use the automation and monitoring for and then chose the best solution based on your requires. Everything has its own advantages and disadvantages, so carefully plan and select what you want.
I get the most value out of using chef to manage my servers' configuration. Monit, SEC, Collectd, and Icinga help me monitor them.