Tips to gracefully take over a (UNIX) production server

After months of neglect, e-mail flames and management battles our current sysadmin was fired and handed over "the server credentials" to me. Such credentials consist in a root password and nothing else: no procedures, no documentation, no tips, nothing.

My question is: assuming he left boobytraps behind, how do I gracefully take over the servers with as little downtime as possible?

Here are the details:

  • one production server located in a server farm in the basement; ubuntu server 9.x probably, with grsec patches (rumours I heard last time I asked the admin)
  • one internal server that contains all internal documentation, file repository, wikis, etc. Again, ubuntu server, few years old.

Assume both servers are patched and up-to-date, so I'd rather not try to hack my way in unless there's a good reason (i.e. that can be explained to upper management).

The production server has a few websites hosted (standard apache-php-mysql), a LDAP server, a ZIMBRA e-mail suite/server, and as far as I can tell a few vmware workstations running. No idea what's happening in there. Probably one is the LDAP master, but that's a wild guess.

The internal server has an internal wiki/cms, a LDAP slave that replicates the credentials from the production server, a few more vmware workstations, and backups running.

I could just go to the server farm's admin, point at the server, tell them 'sudo shut down that server please', log in in single user mode and have my way with it. Same for the internal server. Still, that would mean downtime, upper management upset, the old sysadmin firing back at me saying 'see? you can't do my job' and other nuisances, and most importantly I'd have to lose potentially a few weeks of unpaid time.

On the other end of the spectrum I could just log in as root and inch trough the server to try to make an understanding of what's happening. With all risks of triggering surprises left behind.

I am looking for a solution in the middle: try to keep everything running as it is, while understanding what's happening and how, and most importantly avoiding triggering any booby traps left behind.

What are your suggestions?

So far I thought about 'practicing' with the internal server, disconnecting the network, rebooting with a live cd, dumping the root file system into a USB drive, and load it on a disconnected, isolated virtual machine to understand the former sysadmin way of thinking (a-la 'know your enemy'). Could pull the same feat with the production server, but a full dump would make somebody notice. Perhaps I can just log in as root, check crontab, check the .profile for any commands that's launched, dump the lastlog, and whatever comes to mind.

And that's why I'm here. Any hint, no matter how small, would be greatly appreciated.

Time is also an issue: there could be triggers happening in a few hours, or a few weeks. Feels like one of those bad Hollywood movies, doesn't it?


Solution 1:

As others have said that looks like a loose-loose situation.

(Starting at the end)

  • Completely new deployment

Of course you can't just take the servers down and let the installer do it's magic.

General Process

  • Get budget for a backup server (backup as in storage for the data)
  • create snapshots of the data and place them there before doing anything
  • Get that signed off by management!
  • Gather a list of requirements (is the wiki needed, who is using the VMWare instances, ...)
    • From Management and
    • From Users
  • Get that signed off by management!
  • Shut down unlisted services for a week (one service at a time - iptables may be your friend if you want to just shut down external services but have the suspection that it might still be used from an application on the same host)
    • No reaction? -> final backup, remove from server
    • Reaction? -> Talk to the users of the service
    • Gather new requirements and Geet that signed off by management!
  • all unlisted services down for a month and no reaction? -> rm -rf $service (sounds harsch but what I mean is decommission the service)
  • get budget for a spare server
  • migrate one service at a time to the spare
  • get that signed off by management!
  • shut down the migrated server (power off)
  • find out more people come screaming at you -> yay, you just found the leftovers
  • gather new requirements
  • start up again and migrate services
  • repeat last 4 steps until there are no people coming after your for a month
  • redeploy the server (and get that signed off by management!)
  • rinse and repeat the whole process.
    • the redeployed server is your new spare

What did you gain?

  • Inventory of all services (for you and management)
  • Documentation (after all you need to write something down for management, why not do it properly and make something for you and management)

Been there done that, it's no fun at all :(

Why do you need to get it signed off by management?

  • Make the problems visible
  • Be sure you won't get fired
  • Opportunity to explain risks
    • It's fine if they don't want you to do it, but after all it's their decision to make after they got enough input to judge wether the investement is worth it.

Oh, and present the overall plan to them before you start, with some estimates about what will happen in the worst and best case.

It will cost a lot of time regardless of redeployment if you don't have documentation. There's no need to think of backdoors, IMHO if you don't have documentation a rolling migration is the only way to reach a sane state that will deliver value for the company.

Solution 2:

Do you have reason to believe that the previous admin left something bad behind, or do you just watch a lot of movies?

I'm not asking to be facetious, I'm trying to get an idea what sort of threat you think is there and how probable it is. If you think the chances really are very high that some sort of seriously disruptive problem might really exist then I'd suggest treating it as if it were a successful network intrusion.

In any case, your bosses don't want the disruption of downtime while you deal with this - what is their attitude to planned downtime to tidy systems up vs. unplanned downtime if there is a fault in the system (whether a real fault or a rogue admin) and if their attitude is realistic vs. your assessment of the probability that you will really have a problem here.

Whatever else you do, consider the following:

Take an image of the systems right now. Before you do anything else. In fact, take two and put one aside and don't touch it again until you know what, if anything, is happening with your system, this is your record of how the system was when you took it over.

Restore the "2nd" set of images to some virtual machines and use these to probe what is going on. If you're worried about things being triggered after a certain date then set the date forward a year or so in the virtual machine.

Solution 3:

First of all, if you're going to invest extra time in this I'd advise you to actually get paid for it. It seems you've accepted unpaid overtime as a fact, judging from your words - it shouldn't be that way, in my opinion, and specially not when you're in such a pinch because of someone else's fault (be it management, the old sysadmin or probably a combination of both).

Take the servers down and boot into single user mode (init=/bin/sh or 1 at grub) to check for commands that run on root's login. Downtime is necessary here, make it clear to management that there's no choice but some downtime if they want to be sure they will get to keep their data.

Afterwards look over all cronjobs, even if they look legit. Also perform full backups as soon as possible - even if this means downtime. You can turn your full backups into running VMs if you want.

Then if you can get your hands on new servers or capable VMs I would actually migrate the services to new, clean environments one by one. You can do this in several stages as to minimize perceived downtime. You'll gain much needed in-depth knowledge of the services while restoring your confidence in the base systems.

In the meantime you can check for rootkits using tools as chkrootkit. Run nessus on the servers to look for security holes that the old admin may use.

Edit: I guess I didn't address the "gracefully" part of your question as well as I could. The first step (going into single user mode to check for login traps) can be probably skipped - the old sysadmin giving you the root password and setting up the login to do a rm -rf / would be pretty much the same that deleting all files himself, so there's probably no point on doing that. As per the backup part: try using an rsync based solution so you can do most of the initial backup online and minimize downtime.