What *exactly* gets screwed when I kill -9 or pull the power?

Pulling the power causes everything to stop in flight, with no warning. kill -9 has the same effect on a single process, forcefully terminating it with a SIGKILL.

If a process is killed by kernel or power outage, it doesn't do any clean-up. That means you could have half-written files, inconsistent states, or lost caches. You usually don't have to worry about any of this because of journaling, exit status and battery backup.

Temporary files in /tmp will be automatically gone if they are in tmpfs, but you may still have application-specific lock files laying around to remove, like the lock and .parentlock for firefox.

Most software is smart enough to retry a transaction if it doesn't record a successful exit status. A good example of this is a typical mail system. If a message is being delivered, but gets cut off in the middle, the sender will retry later until it gets a success.

Your filesystem is probably journaled. If you are moving or writing a file and it dies mid-stream, the journaled file system will still reference the original. The journaled filesystem will make changes non-destructively, leaving the old copy, then only reference the new copy as a last step before reclaiming space the old copies occupied on disk.

Now if you have a RAID array, it has all kinds of memory buffers to increase performance and provide reliability in a power failure. Most likely your filesystem will not know about the caches in the device and their state, so it thinks a change has been committed to disk, but it is still in the RAID cache somewhere. So what happens when the power dies? Hopefully you have a functional battery in your RAID enclosure and you monitor it. Otherwise you have a corrupt file system to fsck.

Yes, a few bits can become corrupted in a binary, but I would not worry about that much on modern hardware. If you are really paranoid, you can monitor the health of your disks and RAID with the appropriate tools, but you should be doing that anyway. Do regular backups and get an Uninterruptible Power Supply.


In an unexpected shutdown, the only files which should be corrupted are files which are open for writing. On most systems at any given instant in time, you're probably not writing to a file. Probably.

1 kill -9

is POSIX SIGKILL and is implementation dependent. The process that receives this signal will not be given an opportunity to handle it.

1 Power off

depends on the hardware. The heads auto-park under the drive momentum and Everything in your write cache loses DRAM refresh and decays to irretreivable corruption within seconds. The same happens for your system memory, CPU cache, registers, etc.

From wdc.com (google: site:wdc.com Protective Head Parking )

Power is lost: Hard drive is reset. Head is parked in the landing zone using spindle energy. Spindle motor stopped.

2 - what can go wrong

files left open are incompletely written out. If a file is opened for writing, there will be data corruption. File writes in modern hardware are fast and modern PCs are not normally stressed with IO. It's like walking blindfolded over a quiet country road. Most of the time, you'll be fine.

3 - countermeasures

see above for what disks do.

Look up journaled file systems, they're normal now: http://en.wikipedia.org/wiki/Journaling_file_system

Software like MS Word or vi will write to a temporary file rather than the original. The objective is to never leave the system in a state where there is no consistent copy on disk.

Windows keeps copies of the registry (it's just too important) Wikipedia: "Windows 2000 keeps an alternate copy of the registry hives (.ALT) and attempts to switch to it when corruption is detected" (I haven't done heavy tech support since Win2k, so I'm not sure what MS's new mechanisms are)

4 - what to do

In order of difficulty (easy-hard)

  • Keep backups
  • Check what you were last working on
  • Boot from a separate disk and look for last modified dates/times to figure out what the sytem might have been doing at the time of the crash
  • Boot from a separate disk and compare md5sums of all your files against an offline copy.

Keep backups is the most appropriate answer, good backups should let you go back to the previously modified version.

5

Redundant power? End user education? put tape and cardboard over the power button?

6

Short of hardware malfunctions, corrupted disk drivers, a broken OS kernel, an absence of checksums or crashes during upgrades, binaries and libraries are not opened read-write so they don't get corrupted. It happens, but it's rare.