How can I track down why the rpm DB on my servers keeps getting corrupted?

There's an endless row of bugs where BDB environment getting corrupted, some of which have been BDB bugs (several found just in the last couple of years) that have been patched in Fedora/RHEL libdb but upstream BDB 5.x does not have, dunno about 6.x but there you run into the licensing side. This one is well know issue that has no permanent solution.

Root Cause:

If rpm or yum does not exit cleanly the lock files are left behind. The files (__db001 - __db005) are left behind in /var/lib/rpm. We can see the pid that left the files with. The problem tends to be that we have no logs or audit configure for what actually killed the process. The most common reason being an automation tool timed out and abruptly ends the process without letting rpm clear the lock files.

One possible workaround is to force use of private environment. That also means practically no locking, but at least it means queries will not corrupt anything (however queries themselves could return garbage if run in middle of write-operation). That's what happens if you run queries as non-privileged user, but since you can control permissions with sandboxing you can achieve the same by disallowing open of /var/lib/rpm/.dbenv.lock, which causes rpm to fall back to a private environment - meaning it wont open, much less write to those __db.* files.

The developers statement is that it won't be fixed completely:

"Making BDB more reliable would require using transactions there, but this would be an incompatible change, which is the last thing we want to do at this point when we're basically just about to deprecate BDB. Which means we cannot do anything about this, on Berkeley DB backend, unfortunately."

They provide a suggestion to use dcrpm utility.

dcrpm ("detect and correct rpm") is a tool to detect and correct common issues around RPM database corruption. It attempts a query against your RPM database and runs db4's db_recover if it's hung or otherwise seems broken. It then kills any jobs which had the RPM db open previously since they will be stuck in infinite loops within libdb and can't recover cleanly.

You can download it from Git repo. The official guide is available at the same place.

Here is what you need to do for instalaltion:

# git clone https://github.com/facebookincubator/dcrpm.git
# cd dcrpm
# python setup.py install

After the installation you can run the tool and add it to cron:

# dcrpm

Unfortunately the installation always failed for me on CentOS 7 because of python dependencies never installed properly.

error: Setup script exited with error in psutil setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers.

This is despite psutil got installed successfully. But some other people reported dcrpm worked well for them, so give it a try.

I have used another official solution from Red Hat (RHEL 7).

# curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel7/rpm-deathwatch-rhel-7.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
# yum install -y kernel-{devel,headers}-$(uname -r) systemtap && debuginfo-install -y kernel
# yum install rpm-deathwatch
# systemctl start rpm-deathwatch
# systemctl status rpm-deathwatch