How do you maintain file server integrity without going offline with chkdsk?

Solution 1:

In my opinion chkdsk is not a tool for performing preventive maintenance. If you're having to run chkdsk on a regular basis to correct problems then you have an underlying problem that needs to be solved.

Solution 2:

I maintained file-servers with around 7TB of general user data. That 7TB was built up mostly of office-type files, so we're talking millions. I don't have an exact number because it takes so long to get, but somewhere between 7-12 million files in the various file-systems on our Server 2008 fail-over cluster.

We never run chkdsk except to fix problems, and we never defrag.

NTFS is now self-healing enough that we run into problems very, very rarely. When we do get into problems it's generally due to a fault in the storage-system infrastructure in some way; spontaneous fibre-channel array controller reboot, FC switch panic-and-reboot, that kind of thing. Yanking the power out of the back of the server is eminently surviveable.

In fact, we recently survived a catastrophic UPS failure. The entire room dropped hard, simultaneously. NTFS recovered with nary a peep, and no need to run chkdsk.

About defrag... our FC disk array has 48 drives in it, and as it is an HP EVA the stripes are randomly distributed across the spindles. This means that even largely sequential accesses are actually random as far as the drives are concerned, which further means that a significantly sequential file-system performs minimally better than a significantly fragmented one. Therefore, routine defrags do very little to help for a lot of I/O overhead.

As for preventative maintenance, NTFS is now automated enough to do nearly all of that by itself. Once in a while I'll run chkdsk in read-only mode to see if running it in full mode is worth it. So far on our cluster it has yet to be needed. Even on our 2TB, 4 million file LUN it runs in less than a day.

That said, there are some architectural decisions you can make that can help reduce the eventual need for an offline chkdsk and make it go faster if you ever need to do one:

Set the cache policy on your RAID/SAN controllers to not cache writes. However, this is why battery-backed cache exists, so the performance hit this will cause does not need to be taken. But this is the top thing to do to prevent an offline chkdsk.
Keep your LUNs smaller. File-count matters more than size. A 6TB LUN full of Ghost images will check a lot faster than a 512GB LUN full of 6KB files.
Maintain adequate free-space. A good rule of thumb based on entirely subjective criteria is no less than 15% free at any time.
If your data allows, use a block-size larger than the default 4KB block-size for NTFS. After doing some statistics on my files, I've found I can use 16KB blocks for most of my filesystems. Larger blocks mean fewer blocks to check, and also allow the storage subsystem to take better advantage of read-ahead. Yes, itty bitty files consume more space, but on our volumes it only added about 4% to total size.

Using ssh-agent with KDE?

Do we need a DNS server when we use OpenDNS?

Let $A=\pi^2\int_{0}^{1}\frac{\sin(\pi x)}{1 + \sin(\pi x)}dx$ and $B=\int_{0}^{\pi}\frac{x\sin( x)}{1 + \sin( x)}dx$. Find $\frac{A}{B}$.

Example of a situation when $P(X<Y)$ is not equal to $P(X^{2} < Y^{2})$?

Show that $A$ is not path-connected.

Log function properties and time series data

help checking proof of $f^{*}(f_{*}(\mathcal{D}(f)))=\mathcal{D}(f)$, where $\mathcal{D}(f)$ is the domain of a given function f

Exponential Distribution & Poisson: Example 5.10 Suppose that customers are in line to receive service

Shadow time for a satellite on an inclined orbit

Construction of a sequence function in Halmos' proof of the comparability theorem for well ordered sets using transfinite recursion

How do I use the Taylor series to calculate $\ln(4)$ with an error of $10^{-5}$ with no calculator?

Why do we need to worry about removable singularities when using Residue Theorem?