What sysadmin things should every programmer know?

I'd start with:

  1. Always have a backup system of some kind. Even better if it has a history.
  2. Consider single points of failure and how to deal with them should they fail.
  3. Depending on the amount of computers involved, looking into some way to make and create a standard image across computers will make everyone's life easier - no "it works on mine" because they have such and such a program not normally installed.
  4. Document everything, if only because you will forget how you set something up.
  5. Keep abreast of security updates.

<insert big post disclaimer here>

Some of these have been said before, but it's worth repeating.

Documentation:

  • Document everything. If you don't have one, install an under-the-radar wiki, but make sure you back it up. Start off with collecting facts, and one day, a big picture will form.

  • Create diagrams for each logical chunk and keep them updated. I couldn't count the number of times an accurate network map or cluster diagram has saved me.

  • Keep build logs for each system, even if it's just copy and paste commands for how to build it.

  • When building your system, install and configure your apps, test it works and perform your benchmarking. Now, wipe the disks. Seriously. 'dd' the first megabyte off the front of the disks or otherwise render the box unbootable. The clock is ticking: prove your documentation can rebuild it from scratch (or, even better, prove your colleague can with nothing more than your documentation). This will form half of your Disaster Recovery plan.

  • Now you have the first half your Disaster Recovery plan, document the rest; how to get your application's state back (restore files from tape, reload databases from dumps), vendor/support details, network requirements, how and where to get replacement hardware -- anything you can think of that will help get your system back up.

Automation:

  • Automate as much as you can. If you have to do something three times, make sure the second is spent developing your automation so the third is fully automated. If you can't automate it, document it. There are automation suites out there - see if you can make them work for you.

Monitoring:

  • Application instrumentation is pure gold. Being able to watch transactions passing through the system makes debugging and troubleshooting so much easier.

  • Create end-to-end tests that proves not only that the application is alive, but really does what it's supposed to. Points are yours if it can be jacked into the monitoring system for alerting purposes. This serves double duty; aside from proving the app works, it makes system upgrades significantly easier (monitoring system reports green, upgrade worked, time to go home).

  • Benchmark, monitor and collect metrics on everything everything sane to do so. Benchmarks tell you when to expect something will let out the magic smoke. Monitoring tells you when it has. Metrics and statistics make it easier to get new kit (with fresh magic smoke) through management.

  • If you don't have a monitoring system, implement one. Bonus points if you actually do jack the above end-to-end tests into it.

Security:

  • "chmod 777" (aka grant all access/privileges) is never the solution.

  • Subscribe to the 'least bit' principle; if it's not installed, copied or otherwise living on the disk, it can't get compromised. "Kitchen sink" OS and software installs may make life easier during the build phase, but you end up paying for it down the track.

  • Know what every open port on a server is for. Audit them frequently to make sure no new ones appear.

  • Don't try cleaning a compromised server; it needs to be rebuilt from scratch. Rebuild to a spare server with freshly downloaded media, restoring only the data from backups (as the binaries may be compromised) or clone the compromised host to somewhere isolated for analysis so you can rebuild on the same kit. There's a whole legal nightmare around this, so err on the side of preservation in case you need to pursue legal avenues. (Note: IANAL).

Hardware:

  • Never assume anything will do what it says on the box. Prove it does what you need, just in case it doesn't. You'll find yourself saying "it almost works" more frequently than you'd expect.

  • Do not skimp on remote hardware management. Serial consoles and lights out management should be considered mandatory. Bonus points for remotely-controlled power strips for those times when you're out of options.

(Aside: There are two ways to fix a problem at 3am, one involves being warm, working on a laptop over a VPN in your pyjamas, the other involves a thick jacket and a drive to the datacenter/office. I know which one I prefer.)

Project management:

  • Involve the people that will be maintaining the system from day one of the project lifecycle. The lead times on kit and brain time can and will surprise, and there's no doubt they will (should?) have standards or requirements that will become project dependencies.

  • Documentation is part of the project. You'll never get time to write the whole thing up after the project has been closed and the system has moved to maintenance, so make sure it's included as effort on the schedule at the start.

  • Implement planned obsolescence into the project from day one, and start the refresh cycle six months before the switch off day you specified in the project documentation.

Servers have a defined lifetime when they are suitable for use in production. The end of this lifetime is usually defined as whenever the vendor starts to charge more in annual maintenance than it would cost to refresh the kit, or around three years, whichever is shorter. After this time, they're great for development / test environments, but you should not rely on them to run the business. Revisiting the environment at 2 1/2 years gives you plenty of time to jump through the necessary management and finance hoops for new kit to be ordered and to implement a smooth migration before you send the old kit to the big vendor in the sky.

Development:

  • Ensure your development and staging systems resemble production. VM's or other virtualisation techniques (zones, LDOM's, vservers) make real-world-in-every-sense-but-performance production clones easy.

Backups

  • Data you're not backing up is data you don't want. This is an immutable law. Make sure your reality matches this.

  • Backups are harder than they look; some files will be open or locked, whereas others need to be quiesced to have any hope of recovery, and all of these issues need to be addressed. Some backup packages have agents or other methods to deal with open/locked files, other packages don't. Dumping databases to disk and backing those up counts as one form of "quiescing", but it's not the only method.

  • Backups are worthless unless they're tested. Every few months, pull a random tape out of the archives, make sure it actually has data on it, and the data is consistent.

And most importantly...

Pick your failure modes, or Murphy will... and Murphy doesn't work on your schedule.

Design for failure, document each system's designed weak points, what triggers them and how to recover. It'll make all the difference when something does go wrong.