Upgrading Ubuntu remotely: Howto minimize the risk of losing the server?

Solution 1:

If hardware does not break, there isn't anything you can't do with a serial console, so that's the way to go:

  • get some remote access to serial console (IPMI serial over lan if the system has >=IPMI-2.0, or a null modem serial cable connected to another system where you'll run minicom)
  • configure grub and linux to use the serial console
  • redirect the system BIOS interface on serial if it is possible (many server systems are able to do that)
  • reboot the system and check out that you can use (bios), grub, see dmesg, see init scripts, and login all over the serial console
  • run the upgrade
  • cross your fingers

Also, install the new system on another disk or partition if at all possible, so you can test the new system before erasing the old one. I usually do that with two disks system: I take one disk out of the mirror, create a new (degraded) mirror with the free disk, install there, if everything is ok I destroy the old mirror and hot-add the 'old' disk to the new mirror and let it rebuild.

EDIT: I read it's a Dell R710, AFAIK that should have IPMI2. Configure it running ipmitool locally on the system, and test the serial over lan feature using ipmitool sol enable on another system. Bang! You have your serial console. Dells also are able to redirect BIOS on the serial console (that IPMI will in turn redirect on serial-over-lan). You should have done that anyway to get access to the system if anything goes really bad. I manage a couple of old Dell PE1425 using null modem cables with bios,grub,system serial consoles, and a couple of Dell R300 the same way but using IPMI serial over lan in place of the actual serial cable.

Solution 2:

Personally, depending on how important this server is to your (your business, etc.), I'd get my hands on a similar system and try reproducing the environment and then upgrading it via SSH right in the room (or physically accessible to you) so you can test your procedure. If you can upgrade that without losing your configuration/connection, you stand a pretty good chance of being able to upgrade the remote server.

This won't be 100% exact, but it at least should eliminate errors caused by software upgrades, software configuration, alterations and the like as long as you can make the test system as closely configured to your remote server as possible.

EDIT: Another solution is to create a second server as failover first. This way if the server dies you still have a backup for customers/users until the primary server comes back up. This should alleviate some of the butterflies you're experiences with having one server so far away. Again, this may be kind of overkill in many circumstances, but that depends on how important this business server is to your company and the impact downtime will have as to how much you're willing to spend on making sure it's available in the event of total failure.

Solution 3:

I think that Out-of-Band Management (I'm most familiar with HP's iLO), or even IP KVM would be your best bet.

As Bart mentioned, Testing is invaluable if you have the resources (read: a spare similar box or fellow cluster member).

Finally, (or first, actually) Backups. Tested Backups. Backups you can be proud of...