How can I manage hundreds of IPMI BMCs?

Solution 1:

I'd probably use Ansible. It's a very simple configuration management / orchestration engine that's far simpler to get started with than Puppet (Puppet used to be my go-to choice for this, but not always now, having discovered Ansible).

The benefit of Ansible here is that it communicates directly over SSH, so you'd be able to get started using just your existing SSH credentials and workflow.

If you're currently configuring your BMCs with ipmitool, you'd be able to do something like:

Define a Hosts file -- This tells Ansible which hosts are in the bmc group (in this case), and which to run stuff on.

[bmc]
192.168.1.100
192.168.1.101
192.168.1.102

And so on... You can also use hostnames in that file, as long as they're resolvable.

Then create a "playbook", which is the set of commands to run on each host in a host-group. You want to have this kind of top-down directory layout:

ansible/
   playbooks/
      bmc.yml
      roles/
        bmcconfig/
           files/
           handlers/
             main.yml
           tasks/
             main.yml
           templates/
   group_vars/
      all

A playbook has Roles, which are little sections of configuration that you can break down and reuse.

So I'd create a file called bmc.yml (All Ansible configuration is in YAML files)

---
- name: Configure BMC on the hosts
  hosts: bmc
  user: root
  roles: 
    - bmcconfig

Then inside roles/bmcconfig/tasks/main.yml you can start listing the commands that are to be run on each host, to communicate with ipmi.

---
  - name: Install ipmitool
    apt: pkg=ipmitool state=installed
  - name: Run ipmitool config
    shell: ipmitool -your -options -go -here

When you run the playbook, with ansible-playbook -i hosts bmc.yml the commands listed in tasks/main.yml for each role will be executed in top-down order on each host found in the bmc hostgroup in hosts

group_vars/all is an interesting file, it allows you to define key-value pairs of variables and values that can be used in your playbooks.

so you could define something like

ipmitool_password: $512315Adb

in your group_vars/all and as a result, you'd be able to have something like:

shell: ipmitool -your -options -go -here --password=${ipmitool_password}

in the playbook.

You can find out way more information about how to use the "modules" - the components of Ansible that allow you to do stuff, how to write your own :D, and so on at the Ansible Documentation Pages.

Solution 2:

I have written a small python tool to run command's on our 1000 machines, (and their bmc's, drac's, ilo's and imm's)

What I did was write a python-framework called vsc-manage where I can run command's that are either sent to the server, or the bmc, and then configured what type of machine needs what command.

I have several classes that combine a mix of these command's,

So for machines with an imm it will ssh to the imm, and run power off (in an expect-script kind of way)

For our imb blade chassis's it will run this on the chassis

power -%(command)s -T system:blade[%(blade)s]

For some dell dracs it will run this on the os (of a master node)

idracadm -r %(hostname)s -u root -p '%(password)s' serveraction %(command)s

For our newer hp systems that do ipmi (and I see more and more these days) it will run this on the master:

ipmitool -I lanplus -H %(hostname)s -U %(user)s -P '%(password)s' chassis power %(command)s

or newer dell systems need ipmitool -I open, you might need to play with the protocol a bit.

For settings not included in the ipmi standard I have implemented some things from the DMTF SMASH CLP, e.g. turning the locator led on:

start /system1/led1

All of this in a command line tool that can be run from our laptops, that will connect to the right master node, run the right command for the right node, and return the output, with an additional list of errors if any (based on output on stderr and/or exitcode)

This has proven to be very handy, and adding support for a new class of hardware is relatively easy now (Thanks to the fact that most vendors do fully support ipmi and DMTFSMASHCLP now)

This is not suited for initial configuration (it needs the bmc to have a unique ip and correct gateway, but this is what our vendors need to supply us with on delivery) but can do almost anything else (also run arbitrary commands on the host operating system, and automatially schedule downtime in icinga/nagios when you reboot a node, and/or acknowledge 1000 hosts and services in icinga/nagios at once)

Updating the bmc firmware and adding support for our switches are outstanding issues that are planned.

UPDATE

Since at least some people seemed interested I have given it a last polish today, and open sourced this at https://github.com/hpcugent/vsc-manage

Whilst this is very much targetted towards our own workflow (quattor and/or pbs) I hope it at least can be interesting.