Web site kills hard disk I/O, how to prevent?
The situation: I have a server, on which we have 2-3 projects. Starting not long ago, the server started hanging up (We could not connect to it by ssh, and the connected clients had to wait 20 minutes for top to give results)
Early today I managed to execute gstat while it was in this state and saw, that it stays on 100% on da0, da0s1 and da0s1f. I dont quite know what those ids meen, but I understand that some processes just kill the HD by bombing it down with requests.
I ask of some propositions. I dont know how to find the culpit and can't prevent this.
I have freebsd on server.
If your version of FreeBSD is relatively modern, top
has a -m
option that shows the top I/O talkers if you supply it with the "io
" parameter:
top -m io
In this case, I'd also use the -S
option (to show system processes, in case one of them is the culprit). To behave better under load, I would use -q
(to renice it to run at a higher priority), and -u
(to skip reading /etc/passwd
, which should help it load faster).
Since it's taking so long to run top
, I'd tell it to display just two passes of its output (-d 2
), and then run in batch mode (-b
), so it will automatically exit.
The first moment that you run top
in this way, its first section of output will show cumulative I/O counts for a number of process for quite a ways back (maybe since boot time? I'm actually not sure about this). In the first display, you can see who your top talkers have been over time. In the second display, you can see your top talkers in the past two seconds.
So, putting it all together, and running a find
so that some actual I/O is happening:
# top -S -m io -qu -b -d 2 10
last pid: 39560; load averages: 0.28, 0.19, 0.08 up 6+04:02:29 11:28:28
125 processes: 2 running, 104 sleeping, 19 waiting
Mem: 96M Active, 668M Inact, 122M Wired, 25M Cache, 104M Buf, 17M Free
Swap: 2048M Total, 96K Used, 2048M Free
PID UID VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND
11 0 0 81032823 0 0 0 0 0.00% idle: cpu0
39554 105 129857 556534 74894 0 0 74894 13.62% find
39533 105 443603 614796 0 0 0 0 0.00% sshd
36 0 1793393 0 0 0 0 0 0.00% irq23: vr0
24 0 2377710 2680 0 0 0 0 0.00% irq20: atapci0
50 0 533513 3415672 66 345350 0 345416 62.81% syncer
13 0 78651569 7230 0 0 0 0 0.00% swi4: clock sio
5 0 1911601 20905 0 0 0 0 0.00% g_down
4 0 2368511 12100 0 0 0 0 0.00% g_up
37 0 53308 313 0 0 0 0 0.00% acpi_thermal
last pid: 39560; load averages: 0.28, 0.19, 0.08 up 6+04:02:31 11:28:30
125 processes: 2 running, 104 sleeping, 19 waiting
CPU: 1.9% user, 0.0% nice, 6.0% system, 2.2% interrupt, 89.9% idle
Mem: 96M Active, 671M Inact, 123M Wired, 25M Cache, 104M Buf, 14M Free
Swap: 2048M Total, 96K Used, 2048M Free
PID UID VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND
11 0 0 1115 0 0 0 0 0.00% idle: cpu0
39554 105 606 651 501 0 0 501 100.00% find
39533 105 616 695 0 0 0 0 0.00% sshd
36 0 1251 0 0 0 0 0 0.00% irq23: vr0
24 0 501 20 0 0 0 0 0.00% irq20: atapci0
50 0 2 2 0 0 0 0 0.00% syncer
13 0 313 3 0 0 0 0 0.00% swi4: clock sio
5 0 501 26 0 0 0 0 0.00% g_down
4 0 501 8 0 0 0 0 0.00% g_up
37 0 0 0 0 0 0 0 0.00% acpi_thermal
Once you narrow down which process is doing all of the I/O, you can use truss
or the devel/strace
or sysutils/lsof
ports to see what your disk-hungry processes are doing. (if your system is very busy, of course, you won't be able to install the ports easily):
For example, to see what files and other resources my ntpd
process is using:
# lsof -p 890
ntpd 890 root cwd VDIR 0,93 1024 2 /
ntpd 890 root rtd VDIR 0,93 1024 2 /
ntpd 890 root txt VREG 0,98 340940 894988 /usr/sbin/ntpd
ntpd 890 root txt VREG 0,93 189184 37058 /libexec/ld-elf.so.1
ntpd 890 root txt VREG 0,93 92788 25126 /lib/libm.so.5
ntpd 890 root txt VREG 0,93 60060 25130 /lib/libmd.so.4
ntpd 890 root txt VREG 0,98 16604 730227 /usr/lib/librt.so.1
ntpd 890 root txt VREG 0,93 1423460 25098 /lib/libcrypto.so.5
ntpd 890 root txt VREG 0,93 1068216 24811 /lib/libc.so.7
ntpd 890 root 0u VCHR 0,29 0t0 29 /dev/null
ntpd 890 root 1u VCHR 0,29 0t0 29 /dev/null
ntpd 890 root 2u VCHR 0,29 0t0 29 /dev/null
ntpd 890 root 3u unix 0xc46da680 0t0 ->0xc4595820
ntpd 890 root 5u PIPE 0xc4465244 0 ->0xc446518c
ntpd 890 root 20u IPv4 0xc4599190 0t0 UDP *:ntp
ntpd 890 root 21u IPv6 0xc4599180 0t0 UDP *:ntp
ntpd 890 root 22u IPv4 0xc4599400 0t0 UDP heffalump.prv.tycho.org:ntp
ntpd 890 root 23u IPv4 0xc4599220 0t0 UDP ns0.prv.tycho.org:ntp
ntpd 890 root 24u IPv4 0xc45995c0 0t0 UDP imap.prv.tycho.org:ntp
ntpd 890 root 25u IPv6 0xc4599530 0t0 UDP [fe80:4::1]:ntp
ntpd 890 root 26u IPv6 0xc45993b0 0t0 UDP localhost:ntp
ntpd 890 root 27u IPv4 0xc4599160 0t0 UDP localhost:ntp
ntpd 890 root 28u rte 0xc42939b0 0t0
... and what system calls it's making (note that this can be resource-intensive):
# truss -p 890
SIGNAL 17 (SIGSTOP)
select(29,{20 21 22 23 24 25 26 27 28},0x0,0x0,0x0) ERR#4 'Interrupted system call'
SIGNAL 14 (SIGALRM)
sigreturn(0xbfbfea10,0xe,0x10003,0xbfbfea10,0x0,0x806aed0) ERR#4 'Interrupted system call'
select(29,{20 21 22 23 24 25 26 27 28},0x0,0x0,0x0) ERR#4 'Interrupted system call'
SIGNAL 14 (SIGALRM)
sigreturn(0xbfbfea10,0xe,0x10003,0xbfbfea10,0x0,0x806aed0) ERR#4 'Interrupted system call'
select(29,{20 21 22 23 24 25 26 27 28},0x0,0x0,0x0) ERR#4 'Interrupted system call'
SIGNAL 14 (SIGALRM)
sigreturn(0xbfbfea10,0xe,0x10003,0xbfbfea10,0x0,0x806aed0) ERR#4 'Interrupted system call'
^C
sysutils/strace
is similar to truss
, but you'll need to have the /proc
filesystem mounted:
# strace -p 890
strace: open("/proc/...", ...): No such file or directory
trouble opening proc file
# grep ^proc /etc/fstab
proc /proc procfs rw,noauto 0 0
# mount /proc
# mount | grep /proc
procfs on /proc (procfs, local)
... and then it will work:
# strace -p 890
Process 890 attached - interrupt to quit
--- SIGALRM (Alarm clock: 14) ---
--- SIGALRM (Alarm clock: 14) ---
syscall_417(0xbfbfea10) = -1 (errno 4)
select(29, [?], NULL, NULL, NULL) = -1 EINTR (Interrupted system call)
--- SIGALRM (Alarm clock: 14) ---
--- SIGALRM (Alarm clock: 14) ---
syscall_417(0xbfbfea10) = -1 (errno 4)
select(29, [?], NULL, NULL, NULL^C <unfinished ...>
Process 890 detached
Good luck - let us know what you discover! Once you have the process(es) identified, I may be able to assist further.
EDIT: Note that running lsof
, truss
and strace
can themselves be intensive. I've done some minor updates to try to reduce their impact. Also, if a process is spawing many children quickly, you may have to tell truss
or strace
to follow child processes with the -f
argument.
After some time I found the real problem. As I thought in the last comment it was a memory lack issue.
The culpit was ZEO server for ZODB. It was relying on the system disk IO cache very much, and it backfired, when free memory was less then 500 MB it started to slow down and on 300 MB it just used the disk so much, that the system stoped responding, and even some services started to crush (like sshd).
After changing the cache structure and freeing up to 2 GB of free memory, the issue was cleared.