Should redundant servers have exactly the same configuration, or slightly different?
If you provide a service on two servers to ensure high-availability, is it better to configure them in exactly the same way, of instead should you introduce slight differences to prevent "freak configuration" errors?
We host a Django-based website on a stack of Linux (Ubuntu LTS), Nginx, Apache and Python WSGI, duplicated on three servers behind a load balancer. Currently they are hosted in the Amazon cloud, but we might move to our own datacenter in the future. We recently had an issue on all three servers which was only solved by upgrading the kernel, which makes us think it was an incompatibility between this specific version of the kernel and the physical hardware that Amazon might have started using at the point.
This made me think: would it be better to keep all machines on exactly the same configuration (easier management?), or should we instead keep things slightly difference, so that an incompatibility between two components will only manifest itself on one machine and not all of them, keeping your website in the air?
Keep them the same. The chance that you will have some incompatibility that manifests itself only in a certain configuration are minimal and afterwards you will have to remember the differences for everything you do.
For simplicity they should all be the same config, however there are occassions (mostly dictated by the software in use) where it's just not possible to load-balance and failover becomes the only option - and in such cases having slighlty different configs may be required.
OTOH, for an internet facing service, availability and security must be high on the list of priorities. Good security means applying patches regularly, good availability means that you cannot patch all the boxes at the same time - indeed, the practice I adopted for a similar setup was to apply patches to one live machine as soon as they were available and had been applied and briefly evaluated on a test machine, but to delay the rollout to the other nodes for a couple of days until I knew the patches did not have any adverse effect.
While Sirex is correct - in a perfect world - you would implement patches on a pre-production cluster and test using traffic / data from the production system - in practice this is far from cost effective on such a small scale.
Yes definitely. This will help with troubleshooting issues that arise.
Look at Puppet to manage your config file changes. We would store config files in svn then push our changes. We had a centralized management server that would check out our changes and Puppet would push them. This gives a history of changes so when you make a mistake you can roll it back pretty seamlessly and when you have multiple admins the config changes can be tracked.
Ref: http://puppetlabs.com/