Better way to do [linux] system & environment validation?

Is there a better way to do environment validation? Usecase is a virtualized environment nearing 300 servers created by someone else, and need to validate before accept them (i.e. before I install custom software and find issues after-the-fact).

These are all currently done manually with a paper checklist

  • ssh to a linux server [ this is so the following tests are run from the box ]
  • for each server it communicates with:
    • ping -c 20 X target servers that the linux server is expected to communicate, review packet loss and RTT avg/max/deviation
    • telnet target servers to make sure the appropriate ports are open and accessible for the services they offer (i.e. 1433 sql server, 3306 mysql, 80 webservice, 25 smtp)
  • nslookup to make sure the server is setup on the DNS.

Is there a better way to do system validation?

These are all currently done manually with a paper checklist

  • ssh to linux server
  • cat /proc/cpuinfo to review if cpu core count and clock speed are what was requested
  • df to check diskspace allocated
  • free -m to check memory amount

Are there examples of a better approach, such as setting expected values or ranges in the checks then simply run 'all' tests for pass/fail checking?


Solution 1:

Short answer: script it.

Longer answer: All of the tests you mention above can be automated through simple OS tools. As an example, the ping command can be run, then the return code checked and used to determine pass/fail status. It will take a bit more time to create the script, but it will save significant time running through all 300 of your machines to be tested.

Solution 2:

In addition to what you listed, I also recommend the following checks to include in the bare minimum:

  • List of open ports: make sure that only the necessary ports are open and nothing else
  • List of installed packages should match your predefined list, no extras
  • List of user accounts should match your predefined list, no extras
  • List of groups and their members should match your predefined list, no extras

Are there examples of a better approach, such as setting expected values or ranges in the checks then simply run 'all' tests for pass/fail checking?

Some of the checks may need a definition of acceptable fault rate. For example when checking the available disk space, most probably the values won't be exactly the same on all servers, so your check will need a threshold of acceptable level. Similarly, a few missed pings may be acceptable, so instead of requiring 100% returned pings, a validation for > 95% might be more practical. On the other hand, for some things you should have zero tolerance, such as the list of open ports.

With 300 servers to check, paper-based methods will not work. By the time you finish checking all the machines, some might have already failed quietly. So yes, you have to script it. It shouldn't be too hard to piece it together. Create something that somewhat works and if you get stuck ask on UNIX SE or Stack Overflow for help. Once you have something fully working, you can ask on Code Review for further optimization and cleaning.

It's definitely worth investing in scripting this, so that you can easily rerun the tests to check the health of your server farm.

Solution 3:

Several years later, but the answer I was looking for was found in:

http://www.ansible.com

gather_facts: true

Since ansible is natively agentless via ssh, this already tackled the ssh access need.

The gather_facts feature already got a lot of the needed data of the target system, just a matter of evaluating it (example for diskspace: https://stackoverflow.com/questions/26981907/using-ansible-to-manage-disk-space ).

I've not evaluated ansible for the networking/firewall requirements from the target system, but looks very doable!

Solution 4:

If you have access to a linux server I would try and use nmap to scan the network, it can then report back to you what servers are responding and what services are running on them. keep in mind this scan could cause some potential issues ( depending on the services running on the servers you are scanning ) and you should get consent from the servers owners before running it.