Isolating cause of 0x124 WHEA_UNCORRECTABLE_ERROR consistently at address ntoskrnl.exe+4b314c
I have a Windows 7 64-bit machine that's freezing up roughly once a month. The last five minidumps all indicate "Caused by address" ntoskrnl.exe+4b314c, and I'm trying to figure out who owns (or is triggering failed calls of) the code at that address.
Here's the !analyze -v
output from the most recent mini-dump:
Microsoft (R) Windows Debugger Version 6.3.9600.17029 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [C:\Windows\Minidump\102116-50450-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available
************* Symbol Path validation summary **************
Response Time (ms) Location
Deferred SRV*C:\SymCache*http://msdl.microsoft.com/download/symbols
Symbol search path is: SRV*C:\SymCache*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows 7 Kernel Version 7601 (Service Pack 1) MP (12 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7601.19160.amd64fre.win7sp1_gdr.160211-0600
Machine Name:
Kernel base = 0xfffff800`04201000 PsLoadedModuleList = 0xfffff800`04448730
Debug session time: Fri Oct 21 16:47:24.260 2016 (UTC - 7:00)
System Uptime: 0 days 0:00:25.275
Loading Kernel Symbols
.
Press ctrl-c (cdb, kd, ntsd) or ctrl-break (windbg) to abort symbol loads that take too long.
Run !sym noisy before .reload to track down problems loading symbols.
..............................................................
..........
Loading User Symbols
Mini Kernel Dump does not contain unloaded driver list
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck 124, {0, fffffa802d3f77c8, 0, 0}
Probably caused by : GenuineIntel
Followup: MachineOwner
---------
7: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa802d3f77c8, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000000000, Low order 32-bits of the MCi_STATUS value.
Debugging Details:
------------------
BUGCHECK_STR: 0x124_GenuineIntel
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT
PROCESS_NAME: System
CURRENT_IRQL: 0
ANALYSIS_VERSION: 6.3.9600.17029 (debuggers(dbg).140219-1702) amd64fre
STACK_TEXT:
fffff880`03d1d6f0 fffff800`044c5cb9 : fffffa80`2d3f77a0 fffffa80`24f7eb50 00000000`00000029 00000000`00000000 : nt!WheapCreateLiveTriageDump+0x6c
fffff880`03d1dc10 fffff800`043a4c07 : fffffa80`2d3f77a0 fffff800`0441f2d8 fffffa80`24f7eb50 00000000`00000000 : nt!WheapCreateTriageDumpFromPreviousSession+0x49
fffff880`03d1dc40 fffff800`0430bc55 : fffff800`04481ba0 00000000`00000001 fffffa80`2d456090 fffffa80`24f7eb50 : nt!WheapProcessWorkQueueItem+0x57
fffff880`03d1dc80 fffff800`0427e065 : fffff880`01776e00 fffff800`0430bc30 fffffa80`24f7eb00 00000000`00000000 : nt!WheapWorkQueueWorkerRoutine+0x25
fffff880`03d1dcb0 fffff800`0450fc6a : 00000000`00000000 fffffa80`24f7eb50 00000000`00000080 fffffa80`24eda870 : nt!ExpWorkerThread+0x111
fffff880`03d1dd40 fffff800`04266086 : fffff880`03b31180 fffffa80`24f7eb50 fffff880`03b3c1c0 00000000`00000000 : nt!PspSystemThreadStartup+0x5a
fffff880`03d1dd80 00000000`00000000 : fffff880`03d1e000 fffff880`03d18000 fffff880`03d1d9e0 00000000`00000000 : nt!KxStartSystemThread+0x16
STACK_COMMAND: kb
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: GenuineIntel
IMAGE_NAME: GenuineIntel
DEBUG_FLR_IMAGE_TIMESTAMP: 0
IMAGE_VERSION:
FAILURE_BUCKET_ID: X64_0x124_GenuineIntel_PROCESSOR_MAE_PRV
BUCKET_ID: X64_0x124_GenuineIntel_PROCESSOR_MAE_PRV
ANALYSIS_SOURCE: KM
FAILURE_ID_HASH_STRING: km:x64_0x124_genuineintel_processor_mae_prv
FAILURE_ID_HASH: {435e2195-e498-1e77-0526-f8d7450275e5}
Followup: MachineOwner
And here is the output of !errrec fffffa802d3f77c8
7: kd> !errrec fffffa802d3f77c8
===============================================================================
Common Platform Error Record @ fffffa802d3f77c8
-------------------------------------------------------------------------------
Record Id : 01d22bf56b81ac86
Severity : Fatal (1)
Length : 864
Creator : Microsoft
Notify Type : Machine Check Exception
Timestamp : 10/21/2016 23:47:24 (UTC)
Flags : 0x00000002 PreviousError
===============================================================================
Section 0 : Processor Generic
-------------------------------------------------------------------------------
Descriptor @ fffffa802d3f7848
Section @ fffffa802d3f7920
Offset : 344
Length : 192
Flags : 0x00000001 Primary
Severity : Fatal
Proc. Type : x86/x64
Instr. Set : x64
Error Type : Micro-Architectural Error
Flags : 0x00
CPU Version : 0x00000000000206c0
Processor ID : 0x0000000000000000
===============================================================================
Section 1 : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor @ fffffa802d3f7890
Section @ fffffa802d3f79e0
Offset : 536
Length : 64
Flags : 0x00000000
Severity : Fatal
Local APIC Id : 0x0000000000000000
CPU Id : c0 06 02 00 00 08 20 00 - ff e3 9e 02 ff fb eb bf
00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
===============================================================================
Section 2 : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor @ fffffa802d3f78d8
Section @ fffffa802d3f7a20
Offset : 600
Length : 264
Flags : 0x00000000
Severity : Fatal
Error : Unknown (Proc 0 Bank 2)
Status : 0xb200000000010005
This is a whitebox machine built several years ago (with parts upgraded over time to stay current). Periodically I make sure it still passes all the stress tests I can throw at it (Prime95, Memtest86, etc). I tried some brief retests with no failures, and am going to rerun full cycles overnight.
I thought the freezes originally began after I installed several pieces of software (possibly including drivers) a year or two ago, but didn't have time to investigate or troubleshoot at the time. I can't remember which software it would have been or exactly when (and to be honest, that could be unrelated or a different set of BSOD's already solved). I did go through a while back and cull software / drivers, particularly anything that might have looked suspect or appeared in other, older BSOD's (e.g. stuff like cbfs5.sys).
I've applied the most recent BIOS updates, and the latest drivers that work properly for me. (Some of the hardware is old, and in rare cases I found the newest drivers cause other problems). Most Windows updates are installed (there might be some in the last couple months that haven't been applied yet - since it's a fairly critical workstation, I take a very controlled approach to updates, creating a full backup image beforehand and doing a set of regression tests after each update cycle. As a result I'm slow to update, but generally this machine is more stable and predictable than others I maintain which are set to auto-update. That's one reason I'm putting off Win 10 for now).
Temperatures all seem reasonable.
My system is configured to write Kernel memory dumps, but for reasons unknown to me one isn't being written when this issue occurs (it occurred earlier today but my MEMORY.DMP at that path has a modified date of nearly a month ago).
The motherboard is an Asus P6T6 WS Revolution (X58 chipset) and the CPU is a 2.4GHz Hex Core Intel Xeon E5645. I have 48 GB of ECC RAM installed.
I don't have a ton of experiencing analyzing memory dumps, and would be grateful for any help/suggestions.
Solution 1:
The fault, as hinted at in the error record, comes from the processor’s Machine-Check Architecture.
Some background from MSDN’s Ntdebugging blog: Interpreting a WHEA error for a MCA fault.
You can find all the gory details of MCA in Chapter 15 of the Intel Software Developer’s Manual, Volume 3B.
The useful information in the dump is the last line of the error record, which is the value of the associated IA32_MCi_STATUS model-specific register. That is documented in section 15.3.2.2 of the Intel manual. Your value of 0xb200000000010005
breaks down as:
- Bit 63: Register valid
- Bit 61: Error uncorrected
- Bit 60: Error enabled
- Bit 57: Processor context corrupt
- Bits 31–16: Model-specific error code 1
(which does not appear to be publicly documented for your processor) - Bits 15–0: MCA error code 5
(which according to Table 15‑8 in section 15.9.1 means Internal parity error)
I don’t know whether all that suggests your CPU, or motherboard, or some other hardware might be faulty. It seems unlikely to be a software issue, though, because software should not be able to cause an internal hardware error like this.
You might like to try changing your dump settings from “Small memory dump” to “Kernel memory dump” and waiting for the fault to happen again; perhaps the extra information in the larger dump file will give you some additional clues to what’s going on at the time of the crash.