LDAP dying for unknown reasons

We're using LDAP as a backend user database for our drupal sites (drupal reads from/and writes to it). LDAP has been running fine for weeks but for some reason it justs stopped running today. I was looking at syslog using the below command:

$ grep -i "error\|panic\|unable\|warning\|fatal\|issue" /var/log/syslog | grep slapd

The first errors that appear are these ones:

Sep 20 09:59:38 server slapd[18158]: bdb(dc=example,dc=org): PANIC: fatal region error detected; run recovery
Sep 20 09:59:38 server slapd[18158]: conn=1010 op=7 RESULT tag=107 err=80 text=internal error

Q1: Is there a way to find out more as to what caused the fatal region error? (maybe the ldap module of drupal is issuing commands that openldap doesn't like).

Following the advise from the error message I ran the recovery tool:

$ sudo /etc/init.d/slapd stop
$ sudo /usr/bin/db4.7_recover -v -h /var/lib/ldap/

Finding last valid log LSN: file: 3916 offset 298418
Recovery starting from [3916][298273]
Recovery complete at Tue Sep 20 12:17:42 2011
Maximum transaction ID 80000017 Recovery checkpoint [3916][298418]

$ sudo /etc/init.d/slapd start

What I see next in syslog is this:

Sep 20 12:17:47 server slapd[10286]: @(#) $OpenLDAP: slapd 2.4.21 (Jun  2 2011 19:36:19) $#012#011buildd@allspice:/build/buildd/openldap-2.4.21/debian/build/servers/slapd47
Sep 20 12:17:47 server slapd[10286]: PROXIED attributeDescription "DC" inserted.
Sep 20 12:17:48 server slapd[10287]: bdb(dc=example,dc=org): /var/lib/ldap/log.0000003916: log file unreadable: Permission denied  <----- HERE
Sep 20 12:17:48 server slapd[10287]: bdb(dc=example,dc=org): PANIC: Permission denied
Sep 20 12:17:48 server slapd[10287]: bdb(dc=example,dc=org): Invalid log file: log.0000003916: DB_RUNRECOVERY: Fatal error, run database recovery
Sep 20 12:17:48 server slapd[10287]: bdb(dc=example,dc=org): PANIC: DB_RUNRECOVERY: Fatal error, run database recovery

What is happening there is that one of the log files isn't readable by slapd as indeed the owner has changed from openldap to root and permissions are set to 600:

/var/lib/ldap/log.0000003916: log file unreadable: Permission denied

$ sudo ls -lh /var/lib/ldap/
-rw------- 1 openldap openldap  10M 2011-09-09 11:09 log.0000003914
-rw------- 1 openldap openldap  10M 2011-09-20 09:54 log.0000003915
-rw------- 1 root     root      10M 2011-09-20 12:10 log.0000003916 <----- HERE
-rwxr-xr-x 1 openldap openldap  20M 2011-09-20 12:51 mail.bdb
-rwxr-xr-x 1 openldap openldap 1.7M 2011-09-20 12:51 objectClass.bdb
-rwxr-xr-x 1 openldap openldap  12M 2011-09-20 12:51 sn.bdb
-rwxr-xr-x 1 openldap openldap 1.4M 2011-09-20 12:51 uid.bdb

I fixed this by changing the owner from root to openldap:

$ sudo chown openldap:openldap /var/lib/ldap/log.0000003916

Then slapd started properly:

Sep 20 13:01:40 evergreen slapd[18901]: @(#) $OpenLDAP: slapd 2.4.21 (Jun  2 2011 19:36:19) $#012#011buildd@allspice:/build/buildd/openldap-2.4.21/debian/build/servers/slapd
Sep 20 13:01:40 evergreen slapd[18901]: PROXIED attributeDescription "DC" inserted.
Sep 20 13:01:41 evergreen slapd[18902]: slapd starting

I didn't check the permissions of that log file before running db4.7_recover so I don't know what set the owner to root.

Q2: Is it recommended to run db4.7_recover as openldap to avoid permission issues on log files (e.g. do sudo -u openldap /usr/bin/db4.7_recover -v -h /var/lib/ldap)?

Q3: As a more generic question: is there a way to monitor which process last changed the permission on a file? (that way if the issue is not with running db4.7_recover as sudo I can find out what it is).

Q4: Have you got more advises as to how troubleshooting this type of LDAP issues?

Additional info:
-server=ubuntu 10.04
-openldap=https://help.ubuntu.com/community/OpenLDAPServer


Solution 1:

On busy servers I've seen OpenLDAP having problems with default 1024 max file handles. Increase that limit in /etc/security/limits.conf for openldap user and see if that fixes the problem. Usually if that is the reason slapd will complain about it to logs.

Under certain circumstances slapd can also leak memory over time and as a result Linux kernel OOM will kill it. If that happened, dmesg should tell you a tale how out of memory killer bravely killed slapd.

Q2: should not matter as long as you fix the permissions before starting slapd.

Q3: Linux audit subsystem can help you with that. Read man auditd and auditctl; answering to this is out of scope of your original Q :)

Q4: Increase slapd log level to maximum, read the logs. Google.