Linux server is only using 60% of memory, then swapping

Bacula performance is highly database dependent. Likely, it's postgresql that's killing your server. The high load average and the fairly large % of cpu time spent in wait state clearly show it's waiting for Disk I/O... And that's PostgreSQL's doing. For every file in your backup set its doing at least an UPDATE statement. Don't worry about the swapping.

Do tune the PostgreSQL install. Possibly give individual database (or even tables) their own disks/raid sets to spread the I/O around. You can force PostgreSQL to use aynschronous writes if it isn't already... Although that's trading database integrity for write performance. Boost the hell out of the shared memory available to PostgreSQL. That will alleviate at least a lot of the read on the database. If you've never done it, run VACCUM ANALYZE on the Bacula database as well to give the query optimizer something to work with.

By far, Bacula's weakest point is the database dependencies (and the brain-deadness of some of it...) Run a purge of a recent large backup and notice how long (hours often) it takes to run a couple dozen million queries... Bacula likes comparatively few large files, otherwise it's a dog.

You are I/O-bound. Your system is a little life raft, battered in a stormy sea of buffer/cache/VM paging swells that are 100 feet tall.

Wow. Just...wow. You're moving about 100Mbyte/sec out your I/O, you're deep past 50% CPU time in I/O wait, and you have 4Gb of RAM. The backpressure on this server's VM must be enormous. Under "normal" circumstances, as the system begins to buffer/cache, any free RAM you had is going to be eaten alive in less than 40 seconds.

Would it be possible to post the settings from /proc/sys/vm? This would provide some insight as to what your kernel thinks is "normal".

Those postmaster processes also indicate you're running PostgreSQL in the background. Is this normal for your setup? PostgreSQL in a default config will use very little RAM, but once it's re-tuned for speed, it can chew up 25%-40% of your available RAM quickly. So I can only guess, given the number of them in the output, you're running some kind of production database while you are running backups. This doesn't bode well. Can you give some more info on why it is running? What is the size of the shared memory parameter for all postmaster processes? Would it be possible to shut the service down, or temporarily reconfigure the database to use fewer connects/buffers while the backups are running? This will help to take some of the pressure off the already strained I/O and free RAM. Keep in mind that each postmaster process consumes RAM above and beyond what the database uses for internal caching. So when you make adjustments to memory settings, be careful about which are "shared" and which are "per-process".

If you're using PostgreSQL as part of your backup process, try to re-tune it to accept just the minimum number of connections, and be sure to shrink your per-process parameters down to something reasonable (only a few megs each). The downside to this is that PostgreSQL will spill to disk if it can't work with the dataset in RAM like it wants to, so that will actually increase your disk I/O, so tune carefully.

X11 in and of itself doesn't take much memory, but a full desktop session can consume several megs. Log out any active sessions you have and run your connection from the console or through SSH.

Still, I don't think it's entirely a memory issue. If you are better than 50% I/O wait for extended periods of time (and you're posting figures that touch the 70's), the resulting bottleneck will eventually crush the rest of the system. Much like Darth Vader crushes necks.

Someone on the business end of Darth Vader's death grip

How many flush threads are you configured for? Use

cat /proc/sys/vm/nr_pdflush_threads

to find out and

echo "vm.nr_pdflush_threads = 1" >> /etc/sysctl.conf

to set it to a single thread. Note that the last command makes it permanently load upon reboot. Seeing 1 or 2 in there is not unusual. If you have several cores or lots of spindle/bus capacity for I/O, you'll want to bump these (slightly). More flush threads = more I/O activity, but also more CPU time spent in I/O wait.

Is it default value, or have you bumped it? If you've bumped it, have you considered decreasing the number to reduce the amount of pressure on I/O ops? Or do you have a huge number of spindles and channels to work with, in which case, have you considered increasing the number of flush threads?

P.S. you want to set swappiness to the lower values, not the higher values, to prevent swap-out. Highest value = 100 = swap like crazy when it feels right, lowest value = 0 = try not to swap at all.

If you look at the blocks read in per second (bi) under IO, it dwarfs the swap activity by multiple orders of magnitude. I don't think the swap usage is what's causing your disk thrashing, I think you have something running on the box that is simply causing a lot of disk activity (reads).

I'd investigate the applications running, and see if you can find the culprit.

See if this link answers some of your questions. I regularly see Linux paging (not swapping) out memory long before 60% utilization. This is an expected piece of its memory tuning:

http://www.sheepguardingllama.com/?p=2252

But your lack of buffers/cache worries me. That looks very unusual. So I am thinking that something more is amiss.

Can you try disabling swap entirely?

swapoff /dev/hdb2

or some such- at least that will validate that it's swapping that's your issue, and not something else.

Linux server is only using 60% of memory, then swapping

Related

Recent Posts