Why have 5 different iMacs developed corrupt OS X partitions when the physical drives are fine?
I'm a tech for my local school district and we are having some problems with our iMac Multimedia Lab. Over the past nine months, 5 of the 22 iMacs have suffered from filesystem corruption. The only recourse for these machines has been to completely wipe out the OS X partition and start fresh (with appropriate restore from Time Machine.)
Here's the list for why I know it's file system corruption:
The iMac will not boot OS X. I have seen it stop at the "throbber", the progress bar, or just the Apple logo.
Mounting the iMac disk via Target Disk Mode (love that feature) succeeds, but only for the Bootcamp partition. The OS X partition fails to mount.
A verify of the disk reveals the OS X partition needs to be repaired (I've seen invalid sibling entries, orphaned children etc.). Attempting to repair the disk fails. This latest attempt (just yesterday) said that the catalog B trees could not be rebuilt. I should have made more complete notes on what was said each time, but each time until this last one I assumed it was an odd one-in-a-million kind of event. A fluke
Attempting to repair or rebuild the drives with Drive Genius 3 also fails So far 98% of the user's files have been recoverable via Data Rescue 3 The physical hard drive does not appear to be failing (retrieving files from the drive does not hang or "beachball", the drive does not appear and disappear in Disk Utility, Finder remains responsive, etc.)
Normally I'd chalk it up to a series of bad drives. Just happened to be the unlucky guy who purchased a bad run of iMacs, right? Here's where things start to get interesting. I submit to you, the list of oddities:
The drives verify as good via SMART
- The RAM checks out
- After deleting and re-creating the partition (and re-installing OS X) all problems disappear.
- The corruption has not happened to the same Mac twice
- Bootcamp is installed on the same drive and functions before, after, and during the corruption on the Mac side.
- The Bootcamp partition has not had this issue on ANY of the iMacs
Also, to rule out the obvious:
There have been no brownouts or surges
We seriously doubt a virus, as the malfunctions appear anywhere from simultaneously (two machines went down at the same time about a month ago) to months apart. Plus, the user's documents are restored after reformat, so one would surmise that if it were a malicious program the Mac would keep failing again and again.
The machines have been in a climate-controlled area
It has not been the same user affected
Sometimes the problem occurs after an unavoidable hard shutdown (which occurs only infrequently. These machines are not being excessively powered down improperly. Only what you would expect with a Mac Lab running multimedia five days a week), other times it is completely out-of-the-blue
Frequently used software includes:
- iPhoto
- iDVD
- iMovie
- Safari
The machines are also loaded with Parallels 5, which loads the Bootcamp partition into a VM. Parallels was setup via the standard wizard, no oddball configuration or hacks.
And last but not least, the specs:
- iMac 10,1 (21.5 inch)
- Stock drives
- OS X Snow Leopard (latest updates)
- Stock memory
- Joined to our Active Directory infrastructure
- HFS+ file system (not case sensitive, the default for OS X Snow Leopard)
- No out-of-the-ordinary drive maint. programs. Drive Genius was loaded yesterday afternoon (AFTER recovering from the latest failure) to run a verification on all iMacs, but was not installed prior. All Macs, both those that have failed in the past and those that have never failed, passed with flying colors.
TL;DR: The OS X partition has become corrupted on five different iMacs, but the physical drives are fine. WHY!?!?!
HFS Plus (HFS+) is a fragile and a little outdated filesystem. If you google it, you'll find many reports of filesystem corruption.
Rebooting without unmounting the filesystem is the best way to corrupt it. This happens when the mac freezes for some reason (in my case it's the nvidia video card) or power failures.
Here are some tips, that IMHO should lower the chance of filesystem corruption:
When system freezes, try rebooting from ssh. When the graphics subsystem of my mac freeses, it's still accessible via SSH - try opening ssh connection from your network and reboot it. You could use Apple Remote Desktop (€62) for this task. You should enable ssh access first.
Do
diskutil verifyVolume /
periodically. Yes, even if HFS+ is a journaled filesystem, corruption is possible. You could use Apple Remote Desktop to run this on all classroom computers at once.Use multiple volumes. Using multiple volumes should reduce chance for corruption. Splitting
/
from/Users/
should make restoration easier (either / or /Users will be corrupted). Note that this probably could complicate things with Bootcamp.Mount partition(s) with options, which reduce writing. Mounting partitions with
noatime
option should reduce writing on it. By default every time a file is accessed, it's access timestamp is "touched".Make sure there are no attempts to mount HFS+ partition from other os-es. Is it possible that someone is starting a linux distro from usb/dvd and mounting
/
in rw mode or playing with journal settings?
Hope my answer is helpful.
PS: corruption usually is gradual not sudden. There is a possibility that something specific is causing this, software or workflow. My mind is at Parallels 5, but it should corrupt the bootcamp volume, not the MacOS one. Searching their KB does not reveal anything useful.
PPS: it is fragile because it has no actual system to correct corruption within a file. A journal records transfers and attempts to recopy data in order to return the filesystem to a consistent state but if the file lost is vital (like actual filesystem structure data) then there is no recourse. In fact, because the Catalog File (which is lists all the logical data information) is stored as a file, if it corrupted in certain places your entire filesystem is rendered useless garbage data, or partially tended garbage in the event that it is corrupted and the a journal replay occurs which causes it to restructure the filesystem in a way that is not consistent with the data (e.g. file a and b are 1MB and 2MB respectively but the replay changes them to be 2MB and 1MB resulting in half of the contents of B being inside A).
Things that could do it off the top of my head...
you said you haven't had power surges or brownouts. How are you confirming it? We had a classroom where PC power supplies were blowing seemingly at random. We had to have maintenance staff connect a monitoring meter to the circuit and discovered that outlet is having huge voltage spikes.
Memory isn't seated properly and corrupting data.
Drive cables loose.
marginal hard disks that have a bad set of sectors but not bad enough that it's triggering alerts or scans for bad sectors.
Something in the Windows side via bootcamp is modifying the drive in a way the drive doesn't like. Copy protection? Drive utilities?
You said it's in a lab. What are the students running? Are you monitoring or locking down what can be executed that could be doing it?
You've said this seems to be random, no two machines having this happen in a row. This would lead me to suspect that either a student or group of students are causing it or there is a random power issue in the lab causing it. Is there a way of tracking who last used the machines to see if this issue seems to magically follow one of your users?
Have you considered a periodic check of the machines? You could easily schedule weekly fsck verification passes (until you figure out why the corruption is happening) and then monthly to keep a tab on things.
With a journaled file system, it takes some repeated bad treatment for macs to degrade to the point of not booting. Even bad software doesn't write to the system side of booting, so I would suspect something is clearly amiss. On macs that get shut down cleanly and get attention the whenever minor filesystem errors are repaired (any time a mac restarts and fsck isn't running in preen mode is a sign of trouble on the horizon).
With a deployment of 25 macs, you can easily spend some time being proactive about file system checks and seeing which are not powering down cleanly by setting up a syslog server or other centralized auditing system.