Best practice for automated Linux updates

We are working on a way to perform automatic updates for our RHEL/RHEL-based servers.

Initial idea: Using Puppet, we disable the default repositories and point to our own. Then, we use ensure => latest for the packages we want to automatically update.

Problem: We are seeing that some services restart after an update (duh).

Question: Does anyone have any advice on how to better automate Linux updates and strategies on mitigating the automatic restart of services? We'd prefer a solution that includes Puppet but, if we need to use another service, that is not a deal-breaker.

Edit

Possible solution: I submitted a solution that implements many of what @voretaq7 and @ewwhite suggested. Seems like this is the route I am going for the time being. If you have other suggestions, please comment or submit an answer.


Solution 1:

Your general update strategy is sound: You have a local repo (which I assume you test in a dev environment), and you update everything based of that (I assume known good) repo.

The service restart thing is inevitable: If the underlying code has changed you need to restart the service for that change to take effect. Failing to do so can lead to worse consequences (running code out of sync with a shared library leading to a crash of the application).
In my environment I consider the quarterly patch windows to be quarterly "REBOOT ALL THE THINGS!" windows too. The advantage of such a policy is that you know that your servers will come back up after a restart, and you know they'll work properly (because you test them regularly).


My best advice to you is to schedule the software releases (maybe this means you'll have to trigger them "manually" with puppet), and advise your users of the planned maintenance/downtime.
Alternatively (or as part of this) you can configure redundancy in your environment such that you can have a few machines or services restarting and still provide service to the end users. This may not completely eliminate any disruptions, but it can help minimize them.

The added redundancy also protects you in the event of hardware failures, which are inevitable on a long-enough time scale.

Solution 2:

Is there necessarily a problem with restarting a service after a package update? Test on a small scale before you deploy to see if there are any issues. I recently had an ugly issue with the rpmforge package of DenyHosts. It actually changed the location of its configuration and work directories between revisions from a yum update. That's totally undesired behavior. Typically, within the same revision of RHEL, there aren't too many issues, but you can never be sure without testing and watching the effects closely.

Another option is to selectively update services. Do you always need the latest packages, for example? This goes back to understanding your reasons for running updates. What is the real goal?

The advantage of running your own repo is that you can stage releases or rollouts and manage the schedule. What if you have a hardware peripheral or software vendor that requires RHEL 5.6 and would break under 5.7? That's one of the benefits to managing your own packages.

Solution 3:

@Beaming Mel-Bin

The simplification will eliminate the need for using ssh for loop tools, to start/stop puppet.

First of all you will need to change your manifests to include a variable called "noop" whose value is sourced from the ENC.

So you would have something like this in a class:

noop => $noop_status

Where noop_status is set in your ENC. When you set the value of noop_status to true, the manifest will run in noop mode only.

If you have 100s or 1000s of hosts, you can use an ENC like Dashboard or Foreman that lets you mass change parameters for many hosts, by inheriting them at the "Hostgroup" or "Domain" level. You can then set the value to "false" for a small number of test hosts, overriding the Hostgroup value.

With this, any changes get applied to selected hosts only.

Changing one parameter in a central location can affect any number of hosts, without the need to turn puppet on/off with ssh for loop tools. You can divide your hosts in multiple groups for safety/management.

Also note that instead of hard coding Package version numbers in manifests, you can put them in the ENC. And just like above, you can selectively apply changes and manage rollouts.

If you want more granularity(and complexity) you can even have per class parameters, like noop_status_apacheClass and so on.

This may be harder to manage if you include classes in other classes.