why is kswapd using high CPU on an idle system?

AFAICS this is neither related to free RAM nor SWAP. We have the same problem here which sometimes hits production machines and there is plenty of RAM free, quite often more than 700 MB with no dirty buffers to sync and 0 bytes SWAP used. It definitively looks like a severe Kernel BUG due to some unknown race condition.

Currently we run CentOS Kernel 2.6.18-194.el5 and will try to replace it by some newer kernel, because we think, this might help.

Update:

RedHat had confirmed that it is a kernel issue for 2.6.18-194.el5

Solutions:

Minimum: kernel-2.6.18-194.32.1.el5 contains the immediate bugfix
Better: kernel-2.6.18-238.el5 contains additional kswapd-related bugfixes
Best: kernel-2.6.18-348.4.1.el5 latest kernel which runs with RHEL 5.5 without change

In the meanwhile there is a script, which is able to detect the 100% CPU situation quite well. It is called by our monitoring each minute to inform us about the situation. If the situation stays for too long, affected machines would lock up completely due to more and more unkillable processes using 100% CPU, until the machine becomes completely unmanageable.

Currently the only way known to solve the problem is to manually hard reboot the affected machine. /sbin/reboot fails, because the machine hangs on shutdown quite too often.

To hard-reboot a machine from any root shell commandline without direct access to Console do:

echo 10 > /proc/sys/kernel/panic
echo 1 > /proc/sys/kernel/sysrq
echo s > /proc/sysrq-trigger
sleep 5
echo s > /proc/sysrq-trigger
sleep 1
echo b > /proc/sysrq-trigger

Keep in mind, do this after quiescing the machine, such that there is no more process writing to the disks. This shall prevent that fsck runs in severe trouble after reboot.

Sorry, no real solution, but HTH. And keep in mind, perhaps there might be other things which cause a 100% CPU situation on kswapd than described here. So automating a reboot in this case perhaps is a bad idea.