To fsck or not fsck after 180 days
By default after 180 days or some number of mounts, most Linux filesystems force a file system check (fsck). Of course this can be turned off using, for example, tune2fs -c 0 -i 0 on ext2 or ext3.
On small filesystems, this check is merely an inconvenience. However, given larger filesystems, this check can take hours upon hours to complete. When your users depend on this filesystem for their productivity, say it is serving their home directories via NFS, would you disable the scheduled file system check?
I ask this question because it is currently 2:15am and I'm awaiting a very long fsck to complete (ext3)!
Solution 1:
The 180-day default fsck time is a workaround for the design flaw that ext3 does not support an online consistency check. The real solution is to find a filesystem that supports this. I don't know if any mature filesystem does. It's a real tragedy. Perhaps btrfs will save us one day.
I've responded to the issue of the surprise multi-hour downtime from fsck by doing scheduled reboots with a full fsck as part of standard maintenance. This is better than running into minor corruption during production hours, and having it turn into a real outage.
A big part of the problem is that ext3 has an unreasonably slow fsck. Although xfs has a much faster fsck, it uses too much memory for distributions to encourage xfs by default on large filesystems. Still, on most systems this is a non-issue. Switching to xfs would at least allow for a reasonably fast fsck. This may make running fsck as part of normal maintenance easier to schedule.
If you're running RedHat and considering using xfs, you have to beware of how strongly they discourage the use of xfs and the fact that there are probably few people using xfs on the kernel you're running.
My understanding is that the ext4 project has a goal of at least somewhat improving the fsck performance.
Solution 2:
I would say that this is just another reason for which production servers should not run all alone and always have either a hot/cold backup or take part in a two node cluster. In these days of virtualization, you can easily have a physical main server and a virtual server, which is only a copy of the physical done every X days, ready to take over.
Other then this not so helpful answer, I would say that you should balance the importance of your data... If this is just a cluster node, skip it. If this is a client's non backuped web server, you may want to plan ahead next time :-)
Solution 3:
Depends.. For instance we had one server go down for routine maintenance that was running a QMail stack. QMail creates and kills a lot of files as time goes on, and it was a very busy mail server. The fsck took some 36 hours. It's not like we saved a helluva lot of performance out of the deal, but ultimately I suppose you could argue the filesystem was healthier. Was it really worth the chaos that ensued though? Not. At. All.