How to allow SSH logins even when filesystem forced check fails

Intro

I'm having a problem with a few servers (CentOS 6.4) in a production environment, the problem is making the servers crash from time to time and we need to actually reboot those server in order to be able to use them again.

The problem

The problem is that sometimes after the reboot the servers perform a forced check of the filesystem and when it fails somebody needs to physically go to the server and perform a FSCK manually.

The question

Is there a way to actually boot the system when this forced checks fail to be able to access the servers through SSH? Or is there another solution to have both a regular check of the filesystem and SSH access to the server.

Thanks in advance!


Solution 1:

Firstly, use some remote console connectivity that is not OS dependent. For Dell it's iDRAC, for HP it's iLO, for IBM it's RSA2, etc. This the standard practice, because you can have many other boot errors besides fsck.

Secondly, see the automated fsck question. But if you are doing this "automation", make sure you have tested your backups. With this, your fsck will pass and you can connect via ssh.

Solution 2:

I would concentrate on finding the cause of the initial problem. Either the filesystem corruption is another symptom along with the machines becoming unresponsive, or you are performing an unsafe reboot (a power cycle), or both of the above.

You don't say how your filesystems are arranged and which one(s) are getting corrupt. If you have a very small root filesystem with almost everything else as separate mounts (/sbin, /etc, and a few other things need to stay on the root fs generally) and the things fsck is picking up are on the non-root filesytems then if you are familiar with shell scripting you could adjust the boot process such that

  • only problems on / cause it to block
  • ssh is brought up as soon as possible after / is checked and ready
  • the other filesystems are mounted read-only (and you are alerted by mail perhaps, and other public facing services are not started) if problems are found

That way you can ssh in to fix the other filesystems and kick off a clean reboot to put things back in order.

There are options to set fsck to automatically try fix issues (most are usualy not serious if they are caused by an unsafe reboot, especially with journalled filesystems) but this is usually not recommended in production environments as it can hide a growing problem. Under Debian/Ubuntu/similar look for the FSCKFIX option in /etc/default/rcS, the results get logged in /var/log/fsck/checkfs if /var was ona filesystem that was sucessfully mounted read-write - something similar will exist in CentOS too.

If you really want to fly by the seat of your pants, set pasnum for everything in /etc/fsck (the final column) to 0 and nothing will get checked. This is of course very much not recommended... If you do go with this approach I suggest you set minimal services to start autoimatically on boot, SSH in immediately after reboot, run fsck manually over everything while mounted read-only, remount everything in read+write mode, then start your services (this way you have access to the machine but your user facing services do not start until you are sure the filesystems on the machine are clean).

But really finding the root cause should be your priority here IMO, and remote KVM options are a better idea than risking booting into an OS with potentially corrupt filesystems.