How to troubleshoot CPU HW crash in Ubuntu 18.04

I bought a new computer a few months ago. I installed Ubuntu 18.04 and it's working fine except when I compile c++ code: it freezes hard as soon as there is a spike of high CPU usage (10+ cores).

The only working workaround is to compile with -j8. Going -j10 or above will make the system crash most of the time. -j16 crashes 100% of the time with big projects (and no ccache).

Details about my setup:

  • Asus gaming computer: Asus Strix GT15 - Best Buy link. You've guessed it, I bought it for the GPU... otherwise I would have built it myself with good quality components (especially PSU and heatsink).
  • MB: Asus strix B460-G Gaming
  • CPU: Intel Core i7-10700KF
  • Power supply: Unknown OEM 500W 80 PLUS
  • The crash occurs when the GPU is idle (desktop).
  • I can't install a more recent Ubuntu versions due to the required work environment.

What I tried, but did not resolve the issue (it's a little less frequent, but still happenning):

Bios:

  • I reduced the Turbo to the minimum (1s instead of 60s), the CPU heatsink seems very inefficient for this furnace CPU.
  • Reduced the number of Amps AND maximum Wattage the CPU /Motherboard is allowed to use, in case the PSU is too weak.
  • Increased the fan speed sooner, when the CPU temps hits 50C (temps are not much better, but now it's very loud when compiling)
  • Replaced the OEM "thermal paste" with a high quality paste (reduced temps by 2-3C)

Crash notes:

  • journalctl -b -1 doesn't have any trace about a crash, so I think it's a HW CPU crash...
  • Ctrl-Alt-F* keys do not work
  • Can't connect via ssh after the crash
  • Audio crashes too when it happens
  • I don't think the PSU is the problem because I can use stress -c 16 and ./gpu_burn 300 at the same time and the system doesn't crash. Stress only uses sqrt()...

Thanks in advance!

Update #1

Temps:

  • without these Bios settings mods, they would easily go up to 90C after sustained 100% CPU usage. With these temps, I did not let it run long enough.
  • after the modifications, temps rarely go above 80C.
  • The freeze seems to be related to sudden spike in CPU usage, not by high CPU temps.
  • room temp is 20-22C
  • idle CPU temp is 27-28C

Current kernel:

uname -a
Linux rog 5.4.0-87-generic #98~18.04.1-Ubuntu SMP Wed Sep 22 10:45:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Solution 1:

Everyone should understand the thermal characteristics of their computer, and provide adequate protection. Often users are not aware of how extremely rapid the processor package temperature can increase with a step function load. An example from my 20.04 test server:

doug@s19:~$ sudo turbostat --quiet --Summary --show PkgWatt,PkgTmp --interval 0.1
PkgTmp  PkgWatt
33  1.88    
33  1.69    
33  1.56    
33  1.74    
49  24.99   800 degrees per second
57  133.28  80 degrees per second
61  133.66  40 degrees per second
61  132.58  0 degrees per second
63  133.57  
64  134.12

The load was applied about 4/5ths of the way along the sample time (25 / (133.5 - 1.7) ~= 20%, or 4/5ths) and the temperature already went up 16 degrees, or 800 degrees per second. The load here was the prime95 torture test, the maximum heat sub-test. The example computer is water cooled with the water pump always on at maximum rate. Processor i5-10600K.

For ASUS motherboards, please know that the CPU fan sensor is actually an external thermistor that will lag the actual processor package temperature both in time and value. On my ASUS motherboard, under heavy load, the CPU fan sensor lags the actual processor temperature by 12 degrees.

In the end, it is possible for the processor package temperature to hit the shutdown limit so fast that various monitoring programs or daemons don't even notice. Sometimes thermal protection needs to react sooner to have time to take effect before any overshoot temperature triggers a shutdown.

Method 1: Thermald

For `/etc/thermald/thermal-conf.xml` use the very basic and simple configuration, as per the `man thermal-conf.xml` page:
<?xml version="1.0"?>

<!--
use "man thermal-conf.xml" for details
-->

<!-- BEGIN -->
<ThermalConfiguration>
        <Platform>
                <Name>Overide CPU default passive</Name>
                <ProductName>*</ProductName>
                <Preference>QUIET</Preference>
                <ThermalZones>
                        <ThermalZone>
                                <Type>cpu</Type>
                                <TripPoints>
                                        <TripPoint>
                                                <Temperature>41000</Temperature>
                                                <type>passive</type>
                                        </TripPoint>
                                </TripPoints>
                        </ThermalZone>
                </ThermalZones>
        </Platform>
</ThermalConfiguration>
<!-- END -->

Note: I am using a ridiculously low trip point of 41 degrees, because my system is water cooled and I can not get to desired example temperatures.

doug@s19:~$ sudo systemctl start thermald
doug@s19:~$ sudo systemctl status thermald
● thermald.service - Thermal Daemon Service
     Loaded: loaded (/lib/systemd/system/thermald.service; disabled; vendor preset: enabled)
     Active: active (running) since Fri 2021-11-05 07:41:45 PDT; 17s ago
   Main PID: 3461 (thermald)
      Tasks: 2 (limit: 38214)
     Memory: 2.2M
     CGroup: /system.slice/thermald.service
             └─3461 /usr/sbin/thermald --systemd --dbus-enable --adaptive

Nov 05 07:41:45 s19 systemd[1]: Starting Thermal Daemon Service...
Nov 05 07:41:45 s19 systemd[1]: Started Thermal Daemon Service.
Nov 05 07:41:45 s19 thermald[3461]: 22 CPUID levels; family:model:stepping 0x6:a5:5 (6:165:5)
Nov 05 07:41:45 s19 thermald[3461]: 22 CPUID levels; family:model:stepping 0x6:a5:5 (6:165:5)
Nov 05 07:41:45 s19 thermald[3461]: Polling mode is enabled: 4
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: XML zone: invalid sensor type []

While thermald status shows some complaining, it actually works properly, although a little slow to respond:

doug@s19:~$ sudo turbostat --quiet --Summary --show PkgWatt,PkgTmp --interval 1
PkgTmp  PkgWatt
33      1.44
33      1.34
33      1.33
58      63.26
61      114.43
61      114.68
48      86.59
47      55.48
47      55.53
41      42.77
43      33.43
41      34.30
41      28.04
43      33.63
40      34.45
44      33.57
41      34.40
44      33.85
34      14.50
34      1.33
34      1.33

Adjust the trip point as needed to get the most out of your system while still preventing the overshoot high point causing a shutdown. Having too low a trip point might reduce system performance to undesirable levels.

Method 2: TCC Offset

If your kernel is new enough and your processor is supported, TCC offset can be used to have the processor itself do the thermal throttling. Depending on the timing window parameters, the response time can be much faster. For this example, the timing window was set in BIOS to the fastest response time:

First, find which cooling device:

doug@s19:~$ grep . /sys/devices/virtual/thermal/cooling_device*/type
/sys/devices/virtual/thermal/cooling_device0/type:Fan
/sys/devices/virtual/thermal/cooling_device10/type:Processor
/sys/devices/virtual/thermal/cooling_device11/type:Processor
/sys/devices/virtual/thermal/cooling_device12/type:Processor
/sys/devices/virtual/thermal/cooling_device13/type:Processor
/sys/devices/virtual/thermal/cooling_device14/type:Processor
/sys/devices/virtual/thermal/cooling_device15/type:Processor
/sys/devices/virtual/thermal/cooling_device16/type:Processor
/sys/devices/virtual/thermal/cooling_device17/type:intel_powerclamp
/sys/devices/virtual/thermal/cooling_device18/type:TCC Offset
/sys/devices/virtual/thermal/cooling_device1/type:Fan
/sys/devices/virtual/thermal/cooling_device2/type:Fan
/sys/devices/virtual/thermal/cooling_device3/type:Fan
/sys/devices/virtual/thermal/cooling_device4/type:Fan
/sys/devices/virtual/thermal/cooling_device5/type:Processor
/sys/devices/virtual/thermal/cooling_device6/type:Processor
/sys/devices/virtual/thermal/cooling_device7/type:Processor
/sys/devices/virtual/thermal/cooling_device8/type:Processor
/sys/devices/virtual/thermal/cooling_device9/type:Processor

It is device 18. Set the offset and then check it via turbostat without the --quiet option:

doug@s19:~$ echo 59 | sudo tee /sys/devices/virtual/thermal/cooling_device18/cur_state
59
doug@s19:~$ sudo /home/doug/temp-k-git/linux/tools/power/x86/turbostat/turbostat --Summary --show Bzy_MHz,PkgWatt,PkgTmp --interval 0.1
turbostat version 21.05.04 - Len Brown <[email protected]>
CPUID(0): GenuineIntel 0x16 CPUID levels
CPUID(1): family:model:stepping 0x6:a5:5 (6:165:5) microcode 0xec
...
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x3b641422 (41 C) (100 default - 59 offset)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x883f0800 (37 C)
...
Bzy_MHz PkgTmp  PkgWatt
800     33      1.35
800     33      1.34
800     34      1.40
4187    49      86.23
4100    52      91.72
4100    53      91.29
...

Notice the throttling is virtually immediate, 4.8 GHz would have been the un-throttled CPU frequency. Note that the throttling limit for my processor (not all processors) is the non-turbo maximum clock frequency of 4.1 GHz, and so it can not actually reach the ridiculously low limit of 41 degrees.