Server randomly freezes

Im facing a very strange issue, my debian squeeze freezes up always at night (Berlin, time). Here is what i get from a time and after doing this a few times, it becomes frozen and must be hard-reset.

From /var/log/messages

Dec 11 01:36:11 srv156 kernel: [125983.204251] CPU 1:
Dec 11 01:36:11 srv156 kernel: [125983.204251] Modules linked in: xt_multiport     nf_conntrack_ipv4 nf_defrag_ipv4 xt_recent xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables hwmon_vid snd_hda_codec_atihdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm radeon snd_timer ttm drm_kms_helper snd k10temp i2c_piix4 soundcore snd_page_alloc edac_core parport_pc drm i2c_algo_bit i2c_core shpchp pci_hotplug pcspkr edac_mce_amd parport wmi evdev processor button ext3 jbd mbcache raid1 md_mod sd_mod crc_t10dif ata_generic ahci ohci_hcd pata_atiixp e100 mii libata xhci floppy ehci_hcd thermal thermal_sys usbcore scsi_mod nls_base [last unloaded: i2c_dev]
Dec 11 01:36:11 srv156 kernel: [125983.204251] Pid: 758, comm: flush-9:0 Tainted: G    B      2.6.32-5-amd64 #1 GA-78LMT-USB3
Dec 11 01:36:11 srv156 kernel: [125983.204251] RIP: 0010:[<ffffffff810b3506>]  [<ffffffff810b3506>] find_get_pages_tag+0x66/0xdd
Dec 11 01:36:11 srv156 kernel: [125983.204251] RSP: 0018:ffff8804235e7b30  EFLAGS: 00000286
Dec 11 01:36:11 srv156 kernel: [125983.204251] RAX: ffffffffffffffff RBX: ffff8804235e7c00 RCX: 0000000000000000
Dec 11 01:36:11 srv156 kernel: [125983.204251] RDX: 0000000000040000 RSI: ffffea000496b2a8 RDI: ffffea000496b2a0
Dec 11 01:36:11 srv156 kernel: [125983.204251] RBP: ffffffff8101166e R08: ffff8804235e7af0 R09: 0000000000000000
Dec 11 01:36:11 srv156 kernel: [125983.204251] R10: 0000000000000000 R11: 0000000000040000 R12: ffff8804235e7c08
Dec 11 01:36:11 srv156 kernel: [125983.204251] R13: 0000000d22678a20 R14: ffff8804235e7af0 R15: 00000000091b9060
Dec 11 01:36:11 srv156 kernel: [125983.204251] FS:  0000000000000000(0000) GS:ffff880010440000(0000) knlGS:000000007ebf7b70
Dec 11 01:36:11 srv156 kernel: [125983.204522] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Dec 11 01:36:11 srv156 kernel: [125983.204522] CR2: 00000000dec86000 CR3: 0000000001001000 CR4: 00000000000006e0
Dec 11 01:36:11 srv156 kernel: [125983.204522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 11 01:36:11 srv156 kernel: [125983.204522] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 11 01:36:11 srv156 kernel: [125983.204522] Call Trace:
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff810bb792>] ? pagevec_lookup_tag+0x1a/0x21
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff810ba330>] ? write_cache_pages+0x162/0x327
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff810b9d48>] ? __writepage+0x0/0x25
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff8110758a>] ? writeback_single_inode+0xe7/0x2da
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff81108290>] ? writeback_inodes_wb+0x424/0x4ff
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff81108497>] ? wb_writeback+0x12c/0x1ab
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff8110870d>] ? wb_do_writeback+0x14f/0x165
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff81108754>] ? bdi_writeback_task+0x31/0xaa
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff810c8664>] ? bdi_start_fn+0x0/0xd0
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff810c86d4>] ? bdi_start_fn+0x70/0xd0
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff810c8664>] ? bdi_start_fn+0x0/0xd0
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff81064ac1>] ? kthread+0x79/0x81
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff81064a48>] ? kthread+0x0/0x81
Dec 11 01:36:11 srv156 kernel: [125983.204522]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20

From /var/log/syslog

Dec 10 21:20:29 srv156 kernel: [110625.162930] BUG: Bad page map in process java  pte:14fa4f067 pmd:424b54067
Dec 10 21:20:29 srv156 kernel: [110625.162937] page:ffffea000496c148 flags:0200000000000878 count:2 mapcount:-1 mapping:ffff88014f8d7de8 index:2f4
Dec 10 21:20:29 srv156 kernel: [110625.162946] addr:0000000009096000 vm_flags:00100077 anon_vma:ffff880422410d40 mapping:(null) index:9096
Dec 10 21:20:29 srv156 kernel: [110625.162955] Pid: 21356, comm: java Tainted: G    B      2.6.32-5-amd64 #1
Dec 10 21:20:29 srv156 kernel: [110625.162961] Call Trace:
Dec 10 21:20:29 srv156 kernel: [110625.162966]  [<ffffffff810ca4bf>] ? print_bad_pte+0x232/0x24a
Dec 10 21:20:29 srv156 kernel: [110625.162973]  [<ffffffff810cb56f>] ? unmap_vmas+0x62d/0x931
Dec 10 21:20:29 srv156 kernel: [110625.162980]  [<ffffffff810cfc74>] ? exit_mmap+0xc4/0x148
Dec 10 21:20:29 srv156 kernel: [110625.162986]  [<ffffffff8104bbc1>] ? mmput+0x3c/0xdf
Dec 10 21:20:29 srv156 kernel: [110625.162992]  [<ffffffff8104f81e>] ? exit_mm+0x102/0x10d
Dec 10 21:20:29 srv156 kernel: [110625.162998]  [<ffffffff81051243>] ? do_exit+0x1f8/0x6c9
Dec 10 21:20:29 srv156 kernel: [110625.163004]  [<ffffffff81071abb>] ? futex_wake+0xd6/0xe7
Dec 10 21:20:29 srv156 kernel: [110625.163010]  [<ffffffff8105178a>] ? do_group_exit+0x76/0x9d
Dec 10 21:20:29 srv156 kernel: [110625.163016]  [<ffffffff8105df9f>] ? get_signal_to_deliver+0x310/0x339
Dec 10 21:20:29 srv156 kernel: [110625.163023]  [<ffffffff81010037>] ? do_notify_resume+0x87/0x73f
Dec 10 21:20:29 srv156 kernel: [110625.163029]  [<ffffffff810cc664>] ? handle_mm_fault+0x7aa/0x80f
Dec 10 21:20:29 srv156 kernel: [110625.163036]  [<ffffffff81073f14>] ? compat_sys_futex+0x10d/0x12b
Dec 10 21:20:29 srv156 kernel: [110625.163043]  [<ffffffff812fb546>] ? do_page_fault+0x2e0/0x2fc
Dec 10 21:20:29 srv156 kernel: [110625.163049]  [<ffffffff81010e0e>] ? int_signal+0x12/0x17
Dec 10 21:20:29 srv156 kernel: [110625.163114] BUG: Bad page state in process java  pfn:14fa0c
Dec 10 21:20:29 srv156 kernel: [110625.163120] page:ffffea000496b2a0 flags:020000000002001c count:0 mapcount:-1 mapping:ffff88039dc0db30 index:11e3
Dec 10 21:20:29 srv156 kernel: [110625.164563] Pid: 21356, comm: java Tainted: G    B      2.6.32-5-amd64 #1
Dec 10 21:20:29 srv156 kernel: [110625.164570] Call Trace:
Dec 10 21:20:29 srv156 kernel: [110625.164578]  [<ffffffff810b71a9>] ? bad_page+0x116/0x129
Dec 10 21:20:29 srv156 kernel: [110625.164586]  [<ffffffff810b7692>] ? free_pages_check+0x38/0x57
Dec 10 21:20:29 srv156 kernel: [110625.164595]  [<ffffffff810b89cf>] ? free_hot_cold_page+0x46/0x190
Dec 10 21:20:29 srv156 kernel: [110625.164603]  [<ffffffff810b8b82>] ? __pagevec_free+0x69/0x7f
Dec 10 21:20:29 srv156 kernel: [110625.164611]  [<ffffffff810bba3f>] ? release_pages+0x137/0x18d
Dec 10 21:20:29 srv156 kernel: [110625.164620]  [<ffffffff810d8559>] ? free_pages_and_swap_cache+0x57/0x73
Dec 10 21:20:29 srv156 kernel: [110625.164629]  [<ffffffff810cb5ed>] ? unmap_vmas+0x6ab/0x931
Dec 10 21:20:29 srv156 kernel: [110625.164637]  [<ffffffff810cfc74>] ? exit_mmap+0xc4/0x148
Dec 10 21:20:29 srv156 kernel: [110625.164644]  [<ffffffff8104bbc1>] ? mmput+0x3c/0xdf
Dec 10 21:20:29 srv156 kernel: [110625.164652]  [<ffffffff8104f81e>] ? exit_mm+0x102/0x10d
Dec 10 21:20:29 srv156 kernel: [110625.164660]  [<ffffffff81051243>] ? do_exit+0x1f8/0x6c9
Dec 10 21:20:29 srv156 kernel: [110625.164667]  [<ffffffff81071abb>] ? futex_wake+0xd6/0xe7
Dec 10 21:20:29 srv156 kernel: [110625.164675]  [<ffffffff8105178a>] ? do_group_exit+0x76/0x9d
Dec 10 21:20:29 srv156 kernel: [110625.164683]  [<ffffffff8105df9f>] ? get_signal_to_deliver+0x310/0x339
Dec 10 21:20:29 srv156 kernel: [110625.164692]  [<ffffffff81010037>] ? do_notify_resume+0x87/0x73f
Dec 10 21:20:29 srv156 kernel: [110625.164700]  [<ffffffff810cc664>] ? handle_mm_fault+0x7aa/0x80f

The last piece of log, has been recently posted, because I've just found it. It seems Java process do something and began to slowly eat all the resources of the server. I don't know exactly if this could be the root cause.

Im using Debian Squeeze. uname -a

Linux srv156 2.6.32-5-amd64 #1 SMP Sun Sep 23 11:00:33 UTC 2012 x86_64 GNU/Linux

I really will appreciate your help, i dont know what more to do.


Solution 1:

Looks like a memory problem - the bad pages. Do you have mcelog configured - do you have any files? http://mcelog.org/

Solution 2:

Your kernel is tainted with B flag, which means:

B: A process has been found in a Bad page state, indicating a corruption of the virtual memory subsystem, possibly caused by malfunctioning RAM or cache memory.

Also backtraces in /var/log/syslog says about Bad page map, so as @Jim have already mentioned, you should probably check your ram.

Solution 3:

After 30 days, the server stopped freezing up.

It turns out it was a problem with the memory. Thanks to all folks!