How can I shut down (power off) cluster nodes during low load?

I'm developing software for the energy consulting business and in monitoring energy use in datacenters, I've noticed that the typical electric load "pattern" of a datacenter is just a flat line, because all the gear runs 24/7. If you compare this to the actual usage pattern (network load, CPU usage etc), which we did, you regularly have long stretches with little usage but the full capacity available.

These patterns are very predictable in many cases and to save energy, it would be great to turn off part of the equipment (servers, switches, storage) regularly or in low-load conditions. However, I can think of several aspects that would have to be looked at, including

  • handling peak loads or sudden spikes
  • data consistency among nodes
  • long startup (and, possibly, synchronization) times compared to average uptime of a node

There's probably more. Is there software that handles such a scenario and what else should be looked out for? Is this a viable suggestion to make?

For my purposes, a cluster wouldn't necessarily mean to cluster machines on the OS level, identical hosts that receive requests via a load balancer (i. e. application level clustering) would also count. I'm not sure how MySQL cluster or similar work, but I'd probably count those as well.

I'm looking for advice for any operating system.

See also my post on energy efficiency over at Stack Overflow that brought up this question.


Solution 1:

Power

Use Switched PDUs so that you can turn servers and switches on and off out-of-band. This is OS- and device-independent, which will greatly simplify the configuration and logic that powers things on and off. If your servers all have network-enabled IPMI interfaces, you can use those instead. I would recommend against trying to turn things on and off using higher-level things like wake-on-LAN.

Power up/down Logic

This could take many forms. Some clustering software (such as Moab) has a solution for this built in. Otherwise, you can write some script with the following pseudocode:

  1. Check overall cluster load
  2. If cluster load > threshold1, turn on some nodes
  3. If cluster load < threshold2, turn off some nodes

Put that in cron and have it run every half hour.

Clustering Software Stack

Obviously, you'll need to make sure your clustering software stack can deal with these devices going up and down all the time. Do a lot of testing here, consider obscure timing issues (booting takes time) and any race conditions that will creep up in the power up/down logic you use.

Solution 2:

VMware

The latest version of their enterprise product, VSphere 4, can power down hosts that are not needed to meet capacity and wake them up when needed by distributing the virtual machines in real-time. Combine this with the power/energy savings that you get from consolidating your hardware on a virtualized platform and you can get a significant power savings.

Solution 3:

This was mentioned over on Planet Ubuntu just today. The post can be found here. It talks about the development of a practical solution to power up/down machines on demand in a cloud using PowerNap.