How to determine processor frequency scale down to ~200 MHz due to ThermStatus
I am attempting to determine what is causing an embedded industrial computer (ARK-1550-S9A1E) with Intel 4th Gen Core i5-4300U Dual Core to scale down all the cores to around ~200 MHz from 1.90 GHz
There is several utilities/tools (turbostat or msr) tools that indicate that the reason it has scaled down is because of ThermStatus and "Digital Readout" shows 65 C/149 F.
The device itself is running Ubuntu 18.04 LTS server (no GUI, headless application) and the applications running on it are at most taking 20% of the CPU. There is nothing really to spike up this CPU utilization, so it is incredibly surprising that it is overheating. It is an industrial fan-less PC, so it does have a lot of hardware to dissipate heat.
Below is the output form MSR and turbostat for all the detail regarding the register readings.
[email protected]_64:~$ cat /proc/cpuinfo | grep "MHz"
cpu MHz : 230.404
cpu MHz : 227.324
cpu MHz : 217.117
cpu MHz : 174.135
[email protected]_64:~$
[email protected]_64:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance
[email protected]_64:~$
[email protected]_64:~$ sudo rdmsr 0x770 -f 63:0
rdmsr: CPU 0 cannot read MSR 0x00000770
[email protected]_64:~$ sudo rdmsr 0x771 -f 63:0
rdmsr: CPU 0 cannot read MSR 0x00000771
[email protected]_64:~$ sudo rdmsr 0x772 -f 63:0
rdmsr: CPU 0 cannot read MSR 0x00000772
[email protected]_64:~$ sudo rdmsr 0x773 -f 63:0
rdmsr: CPU 0 cannot read MSR 0x00000773
[email protected]_64:~$ sudo rdmsr 0x775 -f 63:0
rdmsr: CPU 0 cannot read MSR 0x00000775
[email protected]_64:~$ sudo rdmsr 0x777 -f 63:0
rdmsr: CPU 0 cannot read MSR 0x00000777
[email protected]_64:~$ sudo rdmsr 0x19C -f 63:0
88410800
[email protected]_64:~$ sudo rdmsr 0x64E -f 63:0
rdmsr: CPU 0 cannot read MSR 0x0000064e
[email protected]_64:~$ sudo rdmsr 0x64F -f 63:0
rdmsr: CPU 0 cannot read MSR 0x0000064f
[email protected]_64:~$ sudo rdmsr 0x19B -f 63:0
13
[email protected]_64:~$
[email protected]_64$ ./intel-reg-pp.out
hello from intel_reg_pp!
[19CH] IA32_THERM_STATUS Register With HWP Feedback
Command to read: sudo rdmsr 0x19c - f 63:0
Value of register is: 88410800
64 60 50 40 30 20 10
43210987654321098765432109876543210987654321098765432109876543210
0b00000000000000000000000000000000010001000010000010000100000000000
└───────────────┬───────────────┘│└─┬┘└─┬┘└──┬──┘││││││││││││││││
Reserved │ │ │ │ ││││││││││││││││
Reading Valid ─────────────────────┘ │ │ │ ││││││││││││││││
Reading in Deg. Celcius ──────────────┘ │ │ ││││││││││││││││
Reserved ─────────────────────────────────┘ │ ││││││││││││││││
Digital Readout ───────────────────────────────┘ ││││││││││││││││ 65 C -> 149 F
Cross-domain Limit Log ────────────────────────────┘│││││││││││││││
Cross-domain Limit Status ──────────────────────────┘││││││││││││││
Current Limit Log ───────────────────────────────────┘│││││││││││││
Current Limit Status ─────────────────────────────────┘││││││││││││
Power Limit Notification Log ──────────────────────────┘│││││││││││
Power Limit Notification Status ────────────────────────┘││││││││││
Thermal Threshold #2 Log ────────────────────────────────┘│││││││││
Thermal Threshold #2 Status ──────────────────────────────┘││││││││
Thermal Threshold #1 Log ──────────────────────────────────┘│││││││
Thermal Threshold #1 Status ────────────────────────────────┘││││││
Critical Temperature Log ────────────────────────────────────┘│││││
Critical Temperature Status ──────────────────────────────────┘││││
PROCHOT# or FORCEPR# Log ──────────────────────────────────────┘│││
PROCHOT# or FORCEPR# Event ─────────────────────────────────────┘││
Thermal Status Log ──────────────────────────────────────────────┘│
Thermal Status ───────────────────────────────────────────────────┘
[64FH] MSR_CORE_PERF_LIMIT_REASONS
Command to read: sudo rdmsr 0x64f - f 63:0
Value of register is: 1c220002
64 60 50 40 30 20 10
43210987654321098765432109876543210987654321098765432109876543210
0b00000000000000000000000000000000000011100001000100000000000000010
└───────────────┬───────────────┘││││││└─┬─┘│││││││││││└─┬─┘││││
Reserved ││││││ │ │││││││││││ │ ││││
Maximum Efficiency Frequency Log ───┘│││││ │ │││││││││││ │ ││││
Turbo Transistion Attenuation Log ───┘││││ │ │││││││││││ │ ││││
Electical Design Point Log ───────────┘│││ │ │││││││││││ │ ││││
Max Turbo Limit Log ───────────────────┘││ │ │││││││││││ │ ││││
VR Them Alert Log ──────────────────────┘│ │ │││││││││││ │ ││││
Core Power Limiting Log ─────────────────┘ │ │││││││││││ │ ││││
Reserved ───────────────────────────────────┘ │││││││││││ │ ││││
Package-Level PL2 Power Limiting Log ──────────┘││││││││││ │ ││││
Package-Level PL1 Power Limiting Log ───────────┘│││││││││ │ ││││
Thermal Log ─────────────────────────────────────┘││││││││ │ ││││
PROCHOT Log ──────────────────────────────────────┘│││││││ │ ││││
Reserved ──────────────────────────────────────────┘││││││ │ ││││
Maximum Efficiency Frequency Status (R0)────────────┘│││││ │ ││││
Turbo Transition Attenuation Status (R0)─────────────┘││││ │ ││││
Electrical Design Point Status (R0)───────────────────┘│││ │ ││││
Max Turbo Limit Status (R0) ───────────────────────────┘││ │ ││││
VR Therm Alert Status (R0)──────────────────────────────┘│ │ ││││
Core Power Limiting Status (R0)──────────────────────────┘ │ ││││
Reserved ───────────────────────────────────────────────────┘ ││││
Package-Level PL2 Power Limiting Status (R0) ──────────────────┘│││
Package-Level Power Limiting PL1 Status (R0)────────────────────┘││
Thermal Status (R0) ─────────────────────────────────────────────┘│
PROCHOT Status (R0) ──────────────────────────────────────────────┘
[19BH] IA32_THERM_INTERRUPT
Command to read: sudo rdmsr 0x64f - f 63:0
Value of register is: 00000013
64 60 50 40 30 20 10
43210987654321098765432109876543210987654321098765432109876543210
0b10000000000000000000000000000000000000000000000000000000000010011
└───────────────┬──────────────────────┘│└──┬──┘│└──┬──┘└┬┘│││││
Reserved │ │ │ │ │ │││││
Threshold #2 INT Enable ───────────────────┘ │ │ │ │ │││││
Threshold #2 Value ────────────────────────────┘ │ │ │ │││││
Threshold #1 INT Enable ───────────────────────────┘ │ │ │││││
Threshold #1 Value ────────────────────────────────────┘ │ │││││
Reserved ───────────────────────────────────────────────────┘ │││││
Critical Temperature Enable ──────────────────────────────────┘││││
FORCEPR# INT Enable ───────────────────────────────────────────┘│││
PROCHOT# INT enable ────────────────────────────────────────────┘││
Low-Temperature INT enable ──────────────────────────────────────┘│
High-Temperature INT Enable ──────────────────────────────────────┘
decs@ubuntu:~/projects/intel-reg-pp/bin/x86/Debug$
[email protected]_64:~$ sudo turbostat
turbostat version 17.06.23 - Len Brown <[email protected]>
CPUID(0): GenuineIntel 13 CPUID levels; family:model:stepping 0x6:45:1 (6:69:1)
CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, TURBO, DTS, PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu3: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST No-MWAIT PREFETCH TURBO)
CPUID(7): No-SGX
cpu3: MSR_MISC_PWR_MGMT: 0x00400000 (ENable-EIST_Coordination DISable-EPB DISable-OOB)
RAPL: 17476 sec. Joule Counter Range, at 15 Watts
cpu3: MSR_PLATFORM_INFO: 0x8083df3011900
8 * 100.0 = 800.0 MHz max efficiency frequency
25 * 100.0 = 2500.0 MHz base frequency
cpu3: MSR_IA32_POWER_CTL: 0x0004005d (C1E auto-promotion: DISabled)
cpu3: MSR_TURBO_RATIO_LIMIT: 0x1a1a1a1d
26 * 100.0 = 2600.0 MHz max turbo 4 active cores
26 * 100.0 = 2600.0 MHz max turbo 3 active cores
26 * 100.0 = 2600.0 MHz max turbo 2 active cores
29 * 100.0 = 2900.0 MHz max turbo 1 active cores
cpu3: MSR_CONFIG_TDP_NOMINAL: 0x00000013 (base_ratio=19)
cpu3: MSR_CONFIG_TDP_LEVEL_1: 0x0008005c (PKG_MIN_PWR_LVL1=0 PKG_MAX_PWR_LVL1=0 LVL1_RATIO=8 PKG_TDP_LVL1=92)
cpu3: MSR_CONFIG_TDP_LEVEL_2: 0x001900c8 (PKG_MIN_PWR_LVL2=0 PKG_MAX_PWR_LVL2=0 LVL2_RATIO=25 PKG_TDP_LVL2=200)
cpu3: MSR_CONFIG_TDP_CONTROL: 0x00000000 ( lock=0)
cpu3: MSR_TURBO_ACTIVATION_RATIO: 0x00000012 (MAX_NON_TURBO_RATIO=18 lock=0)
cpu3: MSR_PKG_CST_CONFIG_CONTROL: 0x1e008408 (UNdemote-C3, UNdemote-C1, demote-C3, demote-C1, locked: pkg-cstate-limit=8: unlimited)
cpu3: POLL: CPUIDLE CORE POLL IDLE
cpu3: C1: MWAIT 0x00
cpu3: C1E: MWAIT 0x01
cpu3: C3: MWAIT 0x10
cpu3: C6: MWAIT 0x20
cpu3: C7s: MWAIT 0x32
cpu3: C8: MWAIT 0x40
cpu3: C9: MWAIT 0x50
cpu3: C10: MWAIT 0x60
cpu3: cpufreq driver: intel_pstate
cpu3: cpufreq governor: performance
cpufreq intel_pstate no_turbo: 0
cpu3: MSR_MISC_FEATURE_CONTROL: 0x00000000 (L2-Prefetch L2-Prefetch-pair L1-Prefetch L1-IP-Prefetch)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_CORE_PERF_LIMIT_REASONS, 0x1c220002 (Active: ThermStatus, ) (Logged: MultiCoreTurbo, PkgPwrL2, PkgPwrL1, Auto-HWP, ThermStatus, )
cpu0: MSR_GFX_PERF_LIMIT_REASONS, 0x14020002 (Active: ThermStatus, ) (Logged: ThermStatus, PkgPwrL1, )
cpu0: MSR_RING_PERF_LIMIT_REASONS, 0x0c020000 (Active: ) (Logged: ThermStatus, PkgPwrL1, PkgPwrL2, )
cpu0: MSR_RAPL_POWER_UNIT: 0x000a0e03 (0.125000 Watts, 0.000061 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_INFO: 0x00000078 (15 W TDP, RAPL 0 - 0 W, 0.000000 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x804280c800dd80c8 (locked)
cpu0: PKG Limit #1: ENabled (25.000000 Watts, 28.000000 sec, clamp ENabled)
cpu0: PKG Limit #2: ENabled (25.000000 Watts, 0.002441* sec, clamp DISabled)
cpu0: MSR_PP0_POLICY: 0
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_PP1_POLICY: 0
cpu0: MSR_PP1_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: GFX Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00640000 (100 C)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88400800 (36 C)
cpu0: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (100 C, 100 C)
cpu3: MSR_PKGC3_IRTL: 0x00008842 (valid, 67584 ns)
cpu3: MSR_PKGC6_IRTL: 0x00008873 (valid, 117760 ns)
cpu3: MSR_PKGC7_IRTL: 0x00008891 (valid, 148480 ns)
cpu3: MSR_PKGC8_IRTL: 0x000088e4 (valid, 233472 ns)
cpu3: MSR_PKGC9_IRTL: 0x00008945 (valid, 332800 ns)
cpu3: MSR_PKGC10_IRTL: 0x000089ef (valid, 506880 ns)
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI C1 C1E C3 C6 C7s C8 C9 C10 C1% C1E% C3% C6% C7s% C8% C9% C10% CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt CorWattGFXWatt
- - 157 69.94 225 2494 22821 0 447 1810 8751 389 1496 971 329 5 0.09 0.73 11.99 1.14 6.28 7.17 3.16 0.00 20.58 6.78 0.25 2.46 35 36 99.38 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.67 0.22 0.00
0 0 151 64.78 233 2494 6150 0 139 547 2166 145 501 335 80 0 0.11 0.94 11.59 1.74 8.75 9.61 3.02 0.00 22.16 9.01 0.30 3.75 35 36 99.38 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.67 0.22 0.00
0 2 146 68.06 216 2494 6206 0 120 418 2532 82 362 229 96 2 0.09 0.66 13.98 0.88 5.84 7.01 4.02 0.00 18.88
1 1 202 87.77 231 2494 3457 0 68 206 876 35 153 104 34 2 0.07 0.34 4.57 0.41 2.46 3.30 1.27 0.00 6.32 4.55 0.19 1.17 35
1 3 128 59.14 217 2494 7008 0 120 639 3177 127 480 303 119 1 0.09 1.00 17.82 1.52 8.09 8.76 4.33 0.00 34.95
^C
[email protected]_64:~$
What would be a good way of determining what is causing this frequency scaling down from 1.9 GHz to 200 MHz?
Digital readout is made relative to the Thermal Control Circuit tripping point, it's not absolute. So, given MSR_TEMPERATURE_TARGET doesn't seem to suggest any activation offset to act on the default 100°C TjMAX, that ~65°C should actually mean ~35°C (as indeed is being reported in the subsequent turbostat line).
Anyway, MSR_CORE_PERF_LIMIT_REASONS still has the thermal status register set to 1, despite no apparent reason. As you may know, that hints quite a bit to one of the Haswell errata (even though if all the people ever reporting something similar are using linux, I'm half wondering if it couldn't be triggered by a buggy scheduler).
Anyway, since we have no real information about the underlying casue, I'd just try the usual tricks in the throttling fixing toolkit. The only thing is that I wouldn't know about an equivalent of ThrottleStop for linux.