watchdog: BUG: soft lockup - CPU#6 stuck for 23s

I tried every solutions I found on google ... I can't find out why my server is crashing ...

Aug  5 17:11:08  kernel: [ 2300.084576] watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [VM Thread:4054]
Aug  5 17:11:08  kernel: [ 2300.084578] Modules linked in: veth nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo br_netfilter bridge stp llc rpcsec_gss_krb5 auth_rpcgss aufs nfsv4 nfs lockd grace fscache overlay isofs xt_nat xt_MASQUERADE xt_addrtype iptable_nat nf_nat xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev kvm_intel kvm ipmi_si input_leds joydev ipmi_devintf ipmi_msghandler video parport_pc parport acpi_pad sch_fq_codel drm sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_generic aesni_intel crypto_simd cryptd glue_helper usbhid igb hid nvme dca ahci i2c_algo_bit nvme_core libahci
Aug  5 17:11:08  kernel: [ 2300.084616] CPU: 6 PID: 4054 Comm: VM Thread Not tainted 5.4.0-42-generic #46-Ubuntu
Aug  5 17:11:08  kernel: [ 2300.084616] Hardware name: Intel Corporation S1200SP/S1200SP, BIOS S1200SP.86B.03.01.0042.013020190050 01/30/2019
Aug  5 17:11:08  kernel: [ 2300.084620] RIP: 0010:_raw_spin_lock+0x10/0x30
Aug  5 17:11:08  kernel: [ 2300.084621] Code: ff 01 00 00 75 07 4c 89 e0 41 5c 5d c3 e8 f8 f9 62 ff 4c 89 e0 41 5c 5d c3 90 0f 1f 44 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 55 89 c6 48 89 e5 e8 c2 e1 62 ff 66 90 5d c3 66 66 2e 0f
Aug  5 17:11:08  kernel: [ 2300.084621] RSP: 0000:ffffa592c1bef760 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Aug  5 17:11:08  kernel: [ 2300.084622] RAX: 0000000000000000 RBX: 0000000000000100 RCX: ffff95314b79bc00
Aug  5 17:11:08  kernel: [ 2300.084622] RDX: 0000000000000001 RSI: 0000000000000588 RDI: ffff953145c1aeac
Aug  5 17:11:08  kernel: [ 2300.084623] RBP: ffffa592c1bef7b8 R08: ffff95314a5520f0 R09: 0000000000000000
Aug  5 17:11:08  kernel: [ 2300.084623] R10: 0000000000000000 R11: ffffffffffffffb8 R12: 0000000000000000
Aug  5 17:11:08  kernel: [ 2300.084623] R13: ffff953145c1ae00 R14: ffff95314b79bc00 R15: ffff953145c1aeac
Aug  5 17:11:08  kernel: [ 2300.084624] FS:  00007fa0e4151700(0000) GS:ffff953151580000(0000) knlGS:0000000000000000
Aug  5 17:11:08  kernel: [ 2300.084624] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug  5 17:11:08  kernel: [ 2300.084625] CR2: 0000000594832008 CR3: 000000045bf00003 CR4: 00000000003606e0
Aug  5 17:11:08  kernel: [ 2300.084625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug  5 17:11:08  kernel: [ 2300.084625] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug  5 17:11:08  kernel: [ 2300.084626] Call Trace:
Aug  5 17:11:08  kernel: [ 2300.084628]  ? scan_swap_map_slots+0x3cd/0x510
Aug  5 17:11:08  kernel: [ 2300.084629]  get_swap_pages+0x207/0x380
Aug  5 17:11:08  kernel: [ 2300.084630]  ? rmap_walk_anon+0x16f/0x260
Aug  5 17:11:08  kernel: [ 2300.084632]  get_swap_page+0xe3/0x210
Aug  5 17:11:08  kernel: [ 2300.084633]  add_to_swap+0x1a/0x70
Aug  5 17:11:08  kernel: [ 2300.084634]  shrink_page_list+0x4b3/0xbb0
Aug  5 17:11:08  kernel: [ 2300.084648]  shrink_inactive_list+0x201/0x3e0
Aug  5 17:11:08  kernel: [ 2300.084649]  shrink_node_memcg+0x137/0x370
Aug  5 17:11:08  kernel: [ 2300.084650]  shrink_node+0xbd/0x400
Aug  5 17:11:08  kernel: [ 2300.084650]  do_try_to_free_pages+0xd7/0x3a0
Aug  5 17:11:08  kernel: [ 2300.084651]  try_to_free_mem_cgroup_pages+0xf4/0x210
Aug  5 17:11:08  kernel: [ 2300.084653]  try_charge+0x2eb/0x810
Aug  5 17:11:08  kernel: [ 2300.084654]  ? find_get_entry+0xaf/0x170
Aug  5 17:11:08  kernel: [ 2300.084655]  mem_cgroup_try_charge+0x71/0x190
Aug  5 17:11:08  kernel: [ 2300.084656]  ? pagecache_get_page+0x2d/0x300
Aug  5 17:11:08  kernel: [ 2300.084657]  mem_cgroup_try_charge_delay+0x22/0x50
Aug  5 17:11:08  kernel: [ 2300.084658]  do_swap_page+0x220/0x9f0
Aug  5 17:11:08  kernel: [ 2300.084659]  __handle_mm_fault+0x73b/0x7a0
Aug  5 17:11:08  kernel: [ 2300.084659]  handle_mm_fault+0xca/0x200
Aug  5 17:11:08  kernel: [ 2300.084661]  do_user_addr_fault+0x1f9/0x450
Aug  5 17:11:08  kernel: [ 2300.084662]  __do_page_fault+0x58/0x90
Aug  5 17:11:08  kernel: [ 2300.084663]  do_page_fault+0x2c/0xe0
Aug  5 17:11:08  kernel: [ 2300.084664]  page_fault+0x34/0x40
Aug  5 17:11:08  kernel: [ 2300.084665] RIP: 0033:0x7fa168646be3
Aug  5 17:11:08  kernel: [ 2300.084666] Code: 4c 89 6d b8 49 89 5d 00 49 c7 45 08 00 00 00 00 4c 3b 6d b0 0f 83 1d 01 00 00 4c 89 6d b0 49 89 dd 4d 39 fd 0f 83 bd 00 00 00 <49> 8b 45 00 4c 89 eb 83 e0 03 48 83 f8 03 0f 84 09 01 00 00 42 0f
Aug  5 17:11:08  kernel: [ 2300.084666] RSP: 002b:00007fa0e41501b0 EFLAGS: 00010283
Aug  5 17:11:08  kernel: [ 2300.084667] RAX: 00000005237c2908 RBX: 0000000000000004 RCX: 00007fa0e41504b0
Aug  5 17:11:08  kernel: [ 2300.084667] RDX: 0000000000000004 RSI: 0000000594831fe8 RDI: 00007fa160745850
Aug  5 17:11:08  kernel: [ 2300.084668] RBP: 00007fa0e4150230 R08: 00000005237c28e8 R09: 00007fa1607458f0
Aug  5 17:11:08  kernel: [ 2300.084668] R10: 00007fa168f52d99 R11: 000000014b7bf600 R12: 00007fa1609924d0
Aug  5 17:11:08  kernel: [ 2300.084668] R13: 0000000594832008 R14: 0000000000000240 R15: 0000000595000000

I replaced my hardware but my disks.

When I start a docker container (pterodactyl) with Minecraft sometimes it will freeze with the error above. I can't find some relevant logs ...

uname -a : Linux X-X-X 5.4.0-42-generic #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

free -h : total used free shared buff/cache available Mem: 31Gi 594Mi 29Gi 4.0Mi 1.1Gi 30Gi Swap: 1.0Gi 0B 1.0Gi

sysctl vm.swappiness : vm.swappiness = 60

sudo lshw -C memory :

  *-firmware
       description: BIOS
       vendor: Intel Corporation
       physical id: 6
       version: S1200SP.86B.03.01.0042.013020190050
       date: 01/30/2019
       size: 64KiB
       capacity: 16MiB
       capabilities: pci pnp upgrade shadowing cdboot bootselect edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer int10video acpi usb ls120boot zipboot biosbootspecification netboot uefi
  *-cache:0
       description: L1 cache
       physical id: 1a
       slot: L1 Cache
       size: 128KiB
       capacity: 128KiB
       capabilities: synchronous internal write-through instruction
       configuration: level=1
  *-cache:1
       description: L2 cache
       physical id: 1b
       slot: L2 Cache
       size: 1MiB
       capacity: 1MiB
       capabilities: synchronous internal write-through unified
       configuration: level=2
  *-cache:2
       description: L3 cache
       physical id: 1c
       slot: L3 Cache
       size: 8MiB
       capacity: 8MiB
       capabilities: synchronous internal write-back unified
       configuration: level=3
  *-cache
       description: L1 cache
       physical id: 19
       slot: L1 Cache
       size: 128KiB
       capacity: 128KiB
       capabilities: synchronous internal write-through data
       configuration: level=1
  *-memory
       description: System Memory
       physical id: 1e
       slot: System board or motherboard
       size: 32GiB
     *-bank:0
          description: [empty]
          vendor: Empty/NO DIMM
          physical id: 0
          slot: DIMM_A1
     *-bank:1
          description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
          product: KHX2400C15/16G
          vendor: Kingston
          physical id: 1
          serial: A800F9241
          slot: DIMM_A2
          size: 16GiB
          width: 64 bits
          clock: 2400MHz (0.4ns)
     *-bank:2
          description: [empty]
          vendor: Empty/NO DIMM
          physical id: 2
          slot: DIMM_B1
     *-bank:3
          description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
          product: KHX2400C15/16G
          vendor: Kingston
          physical id: 3
          serial: BE305496
          slot: DIMM_B2
          size: 16GiB
          width: 64 bits
          clock: 2400MHz (0.4ns)
  *-memory UNCLAIMED
       description: Memory controller
       product: 100 Series/C230 Series Chipset Family Power Management Controller
       vendor: Intel Corporation
       physical id: 1f.2
       bus info: pci@0000:00:1f.2
       version: 31
       width: 32 bits
       clock: 33MHz (30.3ns)
       capabilities: bus_master
       configuration: latency=0
       resources: memory:a2f10000-a2f13fff

grep -i swap /etc/fstab :

UUID="X-X-X-X-X" swap swap defaults 0 0
UUID="X-X-X-X-X" swap swap defaults 0 0
/swapfile swap swap defaults 0 0

Any ideas ?


Solution 1:

Possible swap/memory problem.

BIOS

Your have BIOS version S1200SP.86B.03.01.0042.013020190050 dated 01/30/2019.

There's a newer BIOS available, dated June 2020, and it can be downloaded here.

Note: Have good backups before updating the BIOS.

Memtest

Go to https://www.memtest86.com/ and download/run their free memtest to test your memory. Get at least one complete pass of all the 4/4 tests to confirm good memory. This may take many hours to complete.

Update #1:

As I previously thought... you have swap problems.

You have THREE swap locations, as seen in /etc/fstab!

UUID="X-X-X-X-X" swap swap defaults 0 0
UUID="X-X-X-X-X" swap swap defaults 0 0
/swapfile swap swap defaults 0 0

Do sudo swapoff -a # turn off swap

Then comment out ALL three of the above lines in /etc/fstab.

It's never ok to completely disable swap. It's not appropriate to have too small of a swap. You have both problems.

Let's create an appropriate /swapfile for your system.

Note: Incorrect use of the dd command can cause data loss. Suggest copy/paste.

sudo swapoff -a           # turn off swap
sudo rm -i /swapfile      # remove old /swapfile

sudo dd if=/dev/zero of=/swapfile bs=1M count=4096

sudo chmod 600 /swapfile  # set proper file protections
sudo mkswap /swapfile     # init /swapfile
sudo swapon /swapfile     # turn on swap
free -h                   # confirm 32G RAM and 4G swap

Add this line to /etc/fstab...

/swapfile    none    swap    sw      0   0

Then reboot the system and verify operation.

If it all works, you can use gparted to delete the two disk partitions with the UUIDs shown in the commented out lines in /etc/fstab. Be careful here, and assure that you've got the correct partitions to delete. Then delete those three commented out lines in /etc/fstab.

Solution 2:

Although the question seems answered, to anyone that finds themselves here with the same CPU error (in addition to heynnemas answer) check your PCI cable connections to any graphics cards you have connected.

I had the same errors and problems stopped after disconnecting a graphics card which I later realised had a faulty (and charred) 6-Pin connection. Replacing the cable returned system functions to normal.

I would also recommend checking CPU/memory timings are not too crazy and that the CPU cooler is attached correctly (tightly).

Solution 3:

I had this error on a VM in a locally run VM farm whose disks were full. The hypervisor was not able to allocate more space to "thin" disk partitions (these have physical space allocated on demand, and the farm was oversubscribed). Note that the hypervisor requires a certain overhead to run (perhaps 10%), and will reserve that space.

It turned out that one of the physical machines had had a problem and wasn't reporting freed up disk space, which lead to the VM farm halucinating that the disks were full. When that machine was rebooted, the problem went away. We're doing an OS and hypervisor update --- hopefully that will prevent the issue in the future.