How long can fsck take on a 30 TB volume?

Solution 1:

fsck speed mainly depends on the number of files and how they are spread in the respective directory. That said, 6 month for a fsck is absolutely absurd: it should had completed in some hours at most, especially if using xfs which has the speedy xfs_repair utility. Here you can find some fsck run at a scale - all completed under one hour (3600s). So, it is not possible that your fsck is still running.

Anyway, an unexpected power loss will not cause a full-blow fsck, rather only a very fast (some seconds) journal replay. However, if some key files was damaged, the OS can be unbootable.

But they probably just lied to you. You should stop paying immediately, ask for an explanation and apply for a total refund.

Solution 2:

Conjecture: Their system uses a BBU/FBWC-less RAID (or even software RAID) with all possible write caches (including these in the hard drives themselves) set at their most aggressive settings, in order to get maximum performance for minimal cost. A hard power outage on such a setup can leave a journaling filesystem in a condition where the journal cannot be trusted and cannot be used for recovery. The problem is that such a system aggressively reorders and postpones writes, which means that a journal entry can be written with the effect of the data action being lost ... or the journal entry being lost on a data action that was consequential.

Recovering such a system from a worst case outage can mean that you have to do a "slow" fsck/repair that actually examines all the filesystem structures as they are, which could indeed take a day or two for 30TB.... and it is not unlikely that you will have to run multiple repair cycles. Add to that that personnel might not be always available to monitor this, you could easily be down to one fsck being done per week. They probably gave up and forgot.

Solution 3:

For most filesystems it will be much faster, even when there are errors, as normally only the metadata is checked.

In the worst case, it may read the whole disk, (e.g. something like fsck.ext4 -cc /dev/sda, which does a non-destructive write test on every block), that could take a few days for 30 TB. If you know the speed of the drives, you can calculate size/speed. For a consumer hard drive with about 100 MB/s copying a few TB can take more hours than most people would expect.

If it were your server, you could have the problem that it boots then hangs when fsck asks you if you want to fix an error. But the datacenter admin won't leave a fsck hanging for 6 month while all VPS are offline.

So they are either lying to you, or there is a huge misunderstanding. Or they were running fsck some time ago and did not update you about the new problem after it finished.