How to diagnose frequent segfaults
My server is logging frequent segmentation faults to /var/log/kern.log in different tools. So far I've seen them in Perl, PHP and rsync. All installed software is up-to-date Debian packages. Here's an exerpt from the log file:
Mar 2 01:07:54 gaz kernel: [ 5316.246303] imapsync[4533]: segfault at 8b ip 00007fb448c98fe6 sp 00007ffff571dd68 error 4 in libperl.so.5.10.1[7fb448bd7000+164000]
Mar 2 01:17:42 gaz kernel: [ 5904.354307] php5-cgi[4441]: segfault at 2bb3dc8 ip 0000000002bb3dc8 sp 00007fffbeeaae48 error 15
Mar 2 02:54:05 gaz kernel: [11687.922316] php5-cgi[4495]: segfault at 2d7acf9 ip 0000000002d7acf9 sp 00007fff60c6eb18 error 15
Mar 2 10:50:08 gaz kernel: [40250.390322] BUG: unable to handle kernel paging request at 00000000024b03f0
Mar 2 10:50:08 gaz kernel: [40250.390341] IP: [<00000000024b03f0>] 0x24b03f0
Mar 2 10:50:08 gaz kernel: [40250.390353] PGD 208c71067 PUD 21c811067 PMD 209329067 PTE 8000000211c88067
Mar 2 10:50:08 gaz kernel: [40250.390365] Oops: 0011 [#1] SMP
Mar 2 10:50:08 gaz kernel: [40250.390373] last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host4/target4:0:0/4:0:0:0/block/sdb/stat
Mar 2 10:50:08 gaz kernel: [40250.390386] CPU 1
Mar 2 10:50:08 gaz kernel: [40250.390392] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative xt_recent xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_
ipv4 ip6table_filter ip6_tables xt_DSCP xt_TCPMSS ipt_LOG ipt_REJECT iptable_mangle iptable_filter xt_multiport xt_state xt_limit xt_conntrack nf_conntrack_ftp nf_conntrack ip_tables x_tables loop snd
_hda_codec_atihdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm radeon snd_timer ttm snd drm_kms_helper soundcore drm snd_page_alloc i2c_algo_bit shpchp i2c_piix4 edac_core pcspkr k8temp evdev edac_m
ce_amd pci_hotplug i2c_core button ext3 jbd mbcache dm_mod powernow_k8 aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 md_mod
sata_nv sata_sil sata_via sd_mod crc_t10dif ata_generic ahci pata_atiixp ohci_hcd libata r8169 mii thermal ehci_hcd processor thermal_sys scsi_mod usbcore nls_base [last unloaded: scsi_wait_scan]
Mar 2 10:50:08 gaz kernel: [40250.390566] Pid: 11482, comm: munin-limits Not tainted 2.6.32-5-amd64 #1 MS-7368
Mar 2 10:50:08 gaz kernel: [40250.390576] RIP: 0010:[<00000000024b03f0>] [<00000000024b03f0>] 0x24b03f0
Mar 2 10:50:08 gaz kernel: [40250.390586] RSP: 0018:ffff88021cc8dec0 EFLAGS: 00010286
Mar 2 10:50:08 gaz kernel: [40250.390593] RAX: 000000001ddc1000 RBX: 0000000000000010 RCX: ffffffff810f9904
Mar 2 10:50:08 gaz kernel: [40250.390600] RDX: 0000000000000000 RSI: ffffea0007688200 RDI: 0000000000000286
Mar 2 10:50:08 gaz kernel: [40250.390608] RBP: 00000000ffffffea R08: 0000000000000025 R09: 7865542f30312e35
Mar 2 10:50:08 gaz kernel: [40250.390615] R10: 000000d01cc8ddf8 R11: 0000000000000246 R12: ffff88021cc8def8
Mar 2 10:50:08 gaz kernel: [40250.390622] R13: 0000000002295010 R14: 00000000022c9db0 R15: 0000000002488d78
Mar 2 10:50:08 gaz kernel: [40250.390630] FS: 00007f3b3c8b2700(0000) GS:ffff880008d00000(0000) knlGS:0000000000000000
Mar 2 10:50:08 gaz kernel: [40250.390641] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 2 10:50:08 gaz kernel: [40250.390648] CR2: 00000000024b03f0 CR3: 000000021c5d1000 CR4: 00000000000006e0
Mar 2 10:50:08 gaz kernel: [40250.390656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 2 10:50:08 gaz kernel: [40250.390663] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 2 10:50:08 gaz kernel: [40250.390671] Process munin-limits (pid: 11482, threadinfo ffff88021cc8c000, task ffff88021bf59530)
Mar 2 10:50:08 gaz kernel: [40250.390681] Stack:
Mar 2 10:50:08 gaz kernel: [40250.390687] ffffffff810f1d4a ffff880208c63228 0000000000000000 00007fffc2dcecc0
Mar 2 10:50:08 gaz kernel: [40250.390697] <0> 00000000024ba2b0 0000000002295010 ffffffff810f1e3d 0000000000000004
Mar 2 10:50:08 gaz kernel: [40250.390712] <0> ffff88021bf59530 ffff88021c4edc00 ffffffff812fe0b6 ffff88021c4edc60
Mar 2 10:50:08 gaz kernel: [40250.390732] Call Trace:
Mar 2 10:50:08 gaz kernel: [40250.390742] [<ffffffff810f1d4a>] ? vfs_fstatat+0x2c/0x57
Mar 2 10:50:08 gaz kernel: [40250.390750] [<ffffffff810f1e3d>] ? sys_newstat+0x11/0x30
Mar 2 10:50:08 gaz kernel: [40250.390760] [<ffffffff812fe0b6>] ? do_page_fault+0x2e0/0x2fc
Mar 2 10:50:08 gaz kernel: [40250.390768] [<ffffffff812fbf55>] ? page_fault+0x25/0x30
Mar 2 10:50:08 gaz kernel: [40250.390777] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Mar 2 10:50:08 gaz kernel: [40250.390783] Code: Bad RIP value.
Mar 2 10:50:08 gaz kernel: [40250.390791] RIP [<00000000024b03f0>] 0x24b03f0
Mar 2 10:50:08 gaz kernel: [40250.390799] RSP <ffff88021cc8dec0>
Mar 2 10:50:08 gaz kernel: [40250.390805] CR2: 00000000024b03f0
Mar 2 10:50:08 gaz kernel: [40250.391051] ---[ end trace 1cc1473b539c7f6e ]---
Mar 2 11:42:20 gaz kernel: [43382.242301] php5-cgi[10963]: segfault at d81160 ip 0000000000d81160 sp 00007fff3adcb058 error 15
Mar 2 21:51:14 gaz kernel: [79916.418302] php5-cgi[20089]: segfault at 1c59dc8 ip 0000000001c59dc8 sp 00007fff9b877fb8 error 15
Mar 3 03:45:01 gaz kernel: [101143.334305] munin-update[22519] general protection ip:7f516dce204c sp:7fff6049a978 error:0 in libperl.so.5.10.1[7f516dc7d000+164000]
Mar 3 11:22:37 gaz kernel: [128599.570307] php5-cgi[22888]: segfault at 36485a8 ip 00000000036485a8 sp 00007fff2d56e1c8 error 15
Mar 4 08:32:17 gaz kernel: [204779.842304] php5-cgi[22090]: segfault at 18 ip 0000000000689e5e sp 00007fff677a6a48 error 6 in php5-cgi[400000+6f9000]
Mar 4 10:01:02 gaz kernel: [210104.434706] rsync[22236] general protection ip:7f14a07137f9 sp:7fff88f940b8 error:0 in libc-2.11.2.so[7f14a069d000+158000]
Mar 4 11:32:22 gaz kernel: [215584.262316] BUG: unable to handle kernel paging request at 00000000ffffff9c
Mar 4 11:32:22 gaz kernel: [215584.262331] IP: [<00000000ffffff9c>] 0xffffff9c
Mar 4 11:32:22 gaz kernel: [215584.262343] PGD 0
Mar 4 11:32:22 gaz kernel: [215584.262350] Oops: 0010 [#2] SMP
Mar 4 11:32:22 gaz kernel: [215584.262359] last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host4/target4:0:0/4:0:0:0/block/sdb/stat
Mar 4 11:32:22 gaz kernel: [215584.262371] CPU 1
Mar 4 11:32:22 gaz kernel: [215584.262378] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative xt_recent xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ip6table_filter ip6_tables xt_DSCP xt_TCPMSS ipt_LOG ipt_REJECT iptable_mangle iptable_filter xt_multiport xt_state xt_limit xt_conntrack nf_conntrack_ftp nf_conntrack ip_tables x_tables loop snd_hda_codec_atihdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm radeon snd_timer ttm snd drm_kms_helper soundcore drm snd_page_alloc i2c_algo_bit shpchp i2c_piix4 edac_core pcspkr k8temp evdev edac_mce_amd pci_hotplug i2c_core button ext3 jbd mbcache dm_mod powernow_k8 aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 md_mod sata_nv sata_sil sata_via sd_mod crc_t10dif ata_generic ahci pata_atiixp ohci_hcd libata r8169 mii thermal ehci_hcd processor thermal_sys scsi_mod usbcore nls_base [last unloaded: scsi_wait_scan]
Mar 4 11:32:22 gaz kernel: [215584.262552] Pid: 1960, comm: proxymap Tainted: G D 2.6.32-5-amd64 #1 MS-7368
Mar 4 11:32:22 gaz kernel: [215584.262563] RIP: 0010:[<00000000ffffff9c>] [<00000000ffffff9c>] 0xffffff9c
Mar 4 11:32:22 gaz kernel: [215584.262573] RSP: 0018:ffff880209257e00 EFLAGS: 00010212
Mar 4 11:32:22 gaz kernel: [215584.262580] RAX: ffff8801514eb780 RBX: ffffffff810efb2d RCX: 0000000000000000
Mar 4 11:32:22 gaz kernel: [215584.262590] RDX: 0000000000000020 RSI: 0000000000000001 RDI: ffff8801514eb780
Mar 4 11:32:22 gaz kernel: [215584.262600] RBP: 00000000ffffffe9 R08: 0000000000000000 R09: 0000000000000000
Mar 4 11:32:22 gaz kernel: [215584.262611] R10: ffff880209257e78 R11: ffffffff81152c7c R12: 0000000000000001
Mar 4 11:32:22 gaz kernel: [215584.262622] R13: 0000000000008001 R14: 0000000000000024 R15: 00000000ffffff9c
Mar 4 11:32:22 gaz kernel: [215584.262633] FS: 00007fca4de35700(0000) GS:ffff880008d00000(0000) knlGS:0000000000000000
Mar 4 11:32:22 gaz kernel: [215584.262644] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 4 11:32:22 gaz kernel: [215584.262650] CR2: 00000000ffffff9c CR3: 00000001c9cbb000 CR4: 00000000000006e0
Mar 4 11:32:22 gaz kernel: [215584.262661] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 4 11:32:22 gaz kernel: [215584.262671] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 4 11:32:22 gaz kernel: [215584.262682] Process proxymap (pid: 1960, threadinfo ffff880209256000, task ffff88021c4b1c40)
Mar 4 11:32:22 gaz kernel: [215584.262693] Stack:
Mar 4 11:32:22 gaz kernel: [215584.262698] ffffffff810f8566 ffff880209257e78 ffff88021c7bf000 ffff88021c7bf0c8
Mar 4 11:32:22 gaz kernel: [215584.262709] <0> 0000800000000000 ffff88021fc0f000 ffff880209257e78 00000000fffffffe
Mar 4 11:32:22 gaz kernel: [215584.262724] <0> ffffffff810e5881 ffff880209257f48 0000000000000286 ffff88021fc0f000
Mar 4 11:32:22 gaz kernel: [215584.262743] Call Trace:
Mar 4 11:32:22 gaz kernel: [215584.262753] [<ffffffff810f8566>] ? do_filp_open+0xa7/0x94b
Mar 4 11:32:22 gaz kernel: [215584.262763] [<ffffffff810e5881>] ? virt_to_head_page+0x9/0x2a
Mar 4 11:32:22 gaz kernel: [215584.262771] [<ffffffff810f9904>] ? user_path_at+0x52/0x79
Mar 4 11:32:22 gaz kernel: [215584.262779] [<ffffffff810cfec1>] ? get_unmapped_area+0xd7/0x139
Mar 4 11:32:22 gaz kernel: [215584.262787] [<ffffffff811019d5>] ? alloc_fd+0x67/0x10c
Mar 4 11:32:22 gaz kernel: [215584.262795] [<ffffffff810eceaf>] ? do_sys_open+0x55/0xfc
Mar 4 11:32:22 gaz kernel: [215584.262804] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Mar 4 11:32:22 gaz kernel: [215584.262811] Code: Bad RIP value.
Mar 4 11:32:22 gaz kernel: [215584.262819] RIP [<00000000ffffff9c>] 0xffffff9c
Mar 4 11:32:22 gaz kernel: [215584.262828] RSP <ffff880209257e00>
Mar 4 11:32:22 gaz kernel: [215584.262833] CR2: 00000000ffffff9c
Mar 4 11:32:22 gaz kernel: [215584.263077] ---[ end trace 1cc1473b539c7f6f ]---
As you can see there are segfaults, a general protection fault and a Kernel Oops. My first guess was that there's a Hardware problem of some sort and I asked my Hoster (it's a rented root server) to do a full hardwarecheck - they did, but couldn't find any problem.
I don't know what and how they checked but their support team is usually quite good. I ran memtester and cpuburn myself and couldn't find any error either.
Unfortunately I have no reliable way to reproduce these segfaults, they seem to be more or less random. On a hunch I disabled the firewall of the system and ran one of the programs that segfaulted regularily (imapsync) and it seemed to take longer to segfault than before, so the problem might be related to the network stack. Or could just be a random thing.
Here are the kernel specs:
# uname -a
Linux gaz 2.6.32-5-amd64 #1 SMP Wed Jan 12 03:40:32 UTC 2011 x86_64 GNU/Linux
# cat /etc/debian_version
6.0
# lsmod
Module Size Used by
cpufreq_userspace 1992 0
cpufreq_stats 2659 0
cpufreq_powersave 902 0
cpufreq_conservative 5162 0
xt_recent 5977 0
xt_tcpudp 2319 0
iptable_nat 4299 0
nf_nat 13388 1 iptable_nat
nf_conntrack_ipv4 9833 3 iptable_nat,nf_nat
nf_defrag_ipv4 1139 1 nf_conntrack_ipv4
ip6table_filter 2384 0
ip6_tables 15075 1 ip6table_filter
xt_DSCP 1995 0
xt_TCPMSS 2919 0
ipt_LOG 4518 0
ipt_REJECT 1953 0
iptable_mangle 2817 0
iptable_filter 2258 0
xt_multiport 2267 0
xt_state 1303 0
xt_limit 1782 0
xt_conntrack 2407 0
nf_conntrack_ftp 5537 0
nf_conntrack 46535 6 iptable_nat,nf_nat,nf_conntrack_ipv4,xt_state,xt_conntrack,nf_conntrack_ftp
ip_tables 13899 3 iptable_nat,iptable_mangle,iptable_filter
x_tables 12845 13 xt_recent,xt_tcpudp,iptable_nat,ip6_tables,xt_DSCP,xt_TCPMSS,ipt_LOG,ipt_REJECT,xt_multiport,xt_state,xt_limit,xt_conntrack,ip_tables
loop 11799 0
radeon 573996 0
ttm 39986 1 radeon
drm_kms_helper 20065 1 radeon
snd_hda_codec_atihdmi 2251 1
drm 142359 3 radeon,ttm,drm_kms_helper
snd_hda_intel 20019 0
i2c_algo_bit 4225 1 radeon
pcspkr 1699 0
i2c_piix4 8328 0
snd_hda_codec 54244 2 snd_hda_codec_atihdmi,snd_hda_intel
i2c_core 15712 5 radeon,drm_kms_helper,drm,i2c_algo_bit,i2c_piix4
snd_hwdep 5380 1 snd_hda_codec
snd_pcm 60503 2 snd_hda_intel,snd_hda_codec
snd_timer 15582 1 snd_pcm
snd 46446 5 snd_hda_intel,snd_hda_codec,snd_hwdep,snd_pcm,snd_timer
soundcore 4598 1 snd
evdev 7352 3
snd_page_alloc 6249 2 snd_hda_intel,snd_pcm
k8temp 3283 0
edac_core 29261 0
edac_mce_amd 6433 0
shpchp 26264 0
pci_hotplug 21203 1 shpchp
button 4650 0
ext3 106518 2
jbd 37085 1 ext3
mbcache 5050 1 ext3
dm_mod 53754 0
powernow_k8 10978 1
aacraid 59779 0
3w_9xxx 28684 0
3w_xxxx 20569 0
raid10 17809 0
raid456 44500 0
async_raid6_recov 5170 1 raid456
async_pq 3479 2 raid456,async_raid6_recov
raid6_pq 77179 2 async_raid6_recov,async_pq
async_xor 2478 3 raid456,async_raid6_recov,async_pq
xor 4380 1 async_xor
async_memcpy 1198 2 raid456,async_raid6_recov
async_tx 1734 5 raid456,async_raid6_recov,async_pq,async_xor,async_memcpy
raid1 18431 3
raid0 5517 0
md_mod 73824 7 raid10,raid456,raid1,raid0
sata_nv 19166 0
sata_sil 7412 0
sata_via 7928 0
sd_mod 29889 8
crc_t10dif 1276 1 sd_mod
ata_generic 3047 0
ahci 32374 6
r8169 29229 0
mii 3210 1 r8169
thermal 11674 0
pata_atiixp 3489 0
libata 133632 6 sata_nv,sata_sil,sata_via,ata_generic,ahci,pata_atiixp
ohci_hcd 19212 0
ehci_hcd 31151 0
processor 29935 1 powernow_k8
thermal_sys 11942 2 thermal,processor
scsi_mod 122149 5 aacraid,3w_9xxx,3w_xxxx,sd_mod,libata
usbcore 122034 3 ohci_hcd,ehci_hcd
nls_base 6377 1 usbcore
# free
total used free shared buffers cached
Mem: 8166128 1228036 6938092 0 140412 782060
-/+ buffers/cache: 305564 7860564
Swap: 2102456 0 2102456
So, basically my questions are:
- How can I diagnose this further?
- Is there any data in the log above that could help me to isolate the troublemaker?
- Are there any known problems with the above hardware/software I overlooked when googling for it?
- Is there a way to prevent the kernel from autoloading modules (I probably don't need all these modules and one of them might be the culprit)
Solution 1:
Check your memory!
The most frequent cause of random segfaults like this is bad memory. Grab a memory checker (such as memtest86+) and test it.
Solution 2:
Things to start with... Check how much memory does the server have. Check the size of your swap partition. Check other log files for potential sources of information (syslog.) Check if there are known problems with the kernel version and your current hardware (or virtualisation system.) I'm running Debian 6 with this kernel in a small (vmware) vm, with no problems.