Linux console is unusable when LDAP server is down

When our OpenLDAP server lost power the CentOS machines' console became nearly unusable.

We were trying to login with a local account, but each command would take minutes to return. Even simple commands like ls were just sitting there.

This does not seem to be a problem with the same configuration under Ubuntu. It takes a while for the initial login to succeed for the local account, but once you are in everything works.

I am looking for a way to mitigate the problem, and came up with a couple ideas:

  • Set a timeout value (if one exists) for the ldap-pam module
  • Run a local ldap db and authenticate with that (it would be a slave of the main one)
  • Create a cron job to enable/disable if we loose connectivity to the ldap server

Are there any better solutions to managing some sort of redundancy/failover with LDAP?


You have several choices.

We use replication to have several LDAP servers on the network, hidden behind a load balancer, so if one goes down, we still have one available. We use keepalived for our load balancing. You can also use keepalived in a failover setup, where you have a hot backup slave.

Secondly, you can have a local LDAP server on each workstation, but this will result in a high maintenance headache, as you need to administer all of these and monitor them to make sure they are keeping up with replication. You don't want them to fall out of sync.

When you have a slave server, make sure you set your updateref option so that any attempted updates are sent to the master server.

There are several settings in /etc/ldap.conf you can use to make the situation better. The most important one is:

bind_policy soft

The default is "hard", which will keep retrying to contact the server with a wait in between. If you set it to soft, it will return immediately. You can also use the timeout options to reduce the amount of time it waits.

# Search timelimit
timelimit 30

# Bind/connect timelimit
bind_timelimit 30

I know you have an accepted answer from David, but I'd like to propose a different approach here and share some of my experiences.

I've found the problem with using bind_policy soft is that if you don't get a response from the server right away, say for example it is busy or you have high network load, you'll get an LDAP failure immediately. For nss_ldap this means that your nss lookup will fail and whatever process was trying to use it will simply report it couldn't find the user or group it was looking up and fail. This can be a problem during normal operations when you your LDAP server is up, which IMO is worse than problems when your server is down.

I've found a more acceptable solution by using the following settings:

bind_policy hard
nss_reconnect_tries 3
nss_reconnect_sleeptime 1
nss_reconnect_maxsleeptime 8
nss_reconnect_maxconntries 2

This way you'll still have a hard connect policy, but the nss_reconnect_* settings will drastically reduce the amount of time your LDAP client will spend trying to get an LDAP result. It also means that during normal usage, if it fails to get an LDAP result on the first try, it will try again and usually get it the second time around. This means fewer failures during normal use.

As far as running a local LDAP server on each workstation, I don't recommend that. What I can point you to instead is nsscache. It was written by some engineers at Google and it solves this problem by creating a local cache of the LDAP database and incrementally updating it through a cron job. You then set up your nsswitch source to use their library instead of nss_ldap and all lookups are local. This has the advantage of greatly reducing the load on your LDAP server, and making all the lookups available if the connection to the server is down. It doesn't have the greatest documentation right now, and isn't in widespread use, but it does work well an the mailing lists are quite responsive.