Good practice for managing package updates for lots of CentOS servers

In most of my environments, it's usually a kickstart and post-install script to get the main system up and current with updates at that moment. I'll usually have a local repo that syncs with a CentOS mirror daily or weekly. I tend to freeze the kernel package at whatever's current as of the installation time and update packages individually or as necessary. Often times, my servers have peripherals that have drivers closely linked to kernel versions, so that's a consideration.

CentOS 5 has matured to the point where constant updates aren't necessary. But also keep in mind that CentOS 5 is winding down. The rate of updates has slowed somewhat, and the nature of the updates is more inline with bugfixes and less about major functionality changes.

So in this specific case, the first thing you could do is build a local mirror/repo. Use your existing configuration management to control access to third-party repos. Maybe schedule policy to yum update critical or public-facing services (ssh, http, ftp, dovecot, etc.) Everything else will require testing, but I get the feeling that most environments don't run with fully-updated/patched systems.


There are many tools that can help with this! It general the package system and which packages go where is handled by configuration management. These tools usually cover more than just yum and the rpms though, and will save you time and prevent many many headaches!

The tool I'm most familiar with is puppet which I use to manage virtually every config in my environment. Here are some puppet examples for managing yum specifically:

http://people.redhat.com/dlutter/puppet-app.html

There are a number of configuration management tools currently available, these have pretty big user groups:

  • Cfengine http://cfengine.com/cfengine3
  • Puppet http://puppetlabs.com/puppet/puppet-difference/
  • Chef http://wiki.opscode.com/display/chef/Home (A few of the people I know have recently implemented this and love it)

Implementing these in an environment will add years to your life. It reduces the number of headaches from poorly configured systems and allows easy upgrading/updating. Most of these tools can also provide some audit level functionality as well which can greatly reduce the time-to-repair for configuration mistakes.

In regards to your question about testing I've been using a staging environment that we direct some customers load to (usually beta customers or a small subset of production traffic). We usually let this cluster run new code for at least a couple days, up to a week (depending on the gravity of the change) before we deploy it to production. Usually I've found this setup works best if you try and figure out how long most errors take to discover. In heavily used systems this can be a matter of hours, in most environments I've seen a week is long enough to discover even uncommon bugs in staging/QA.

One really important part about testing is replication of data/usage. You mentioned you have staging versions of most of your production hardware. Do they also have identical copies of the production data? Can you replay any of the production load against it? Can you even make it part of the production cluster using traffic mirroring? This usually becomes a direct trade-off between amount of resources the business is willing to spend on testing/QA. The more testing the better, try not to self limit (within reason) and see what the business will support (then find a way to do 10% more).