Pshare hint is removed from the pshare hash table caused PSOD

After upgrading the ESXi host to 6.5 U1, the host may experience a PSOD. This post will help in verifying the symptoms. It also helps in summarize the cause and the solution for the issue.

Symptoms:

In vmkernel logs we see following messages prior to PSOD:

04:40:54.761Z cpu21:14392120)WARNING: UserMem: 14034: vmx-vthread-6: vpn 0xa00bc795 status: "Invalid address" (bad0026)
 04:40:54.763Z cpu21:14392120)WARNING: UserMem: 14034: vmx-vthread-6: vpn 0xa00bc7b5 status: "Invalid address" (bad0026
 04:40:54.764Z cpu21:14392120)WARNING: UserMem: 14034: vmx-vthread-6: vpn 0xa00bc7d5 status: "Invalid address" (bad0026)
 04:40:54.765Z cpu21:14392120)WARNING: UserMem: 14034: vmx-vthread-6: vpn 0xa00bc7f5 status: "Invalid address" (bad0026)
 04:40:54.766Z cpu21:14392120)WARNING: UserMem: 14034: vmx-vthread-6: vpn 0xa00bc815 status: "Invalid address" (bad0026)
 04:40:54.768Z cpu21:14392120)WARNING: UserMem: 14034: vmx-vthread-6: vpn 0xa00bc835 status: "Invalid address" (bad0026)

PSOD stack :

VmMemCowPShareRemoveWithCheck@vmkernel#nover+0x10f stack: 0x418011d0
 VmMemCow_CopyPageWithMPN@vmkernel#nover+0x19f stack: 0x3fffffffff,0
 VmMemPf@vmkernel#nover+0x133 stack: 0x449fd475255d69,
PShareHashTableWalkMatchMPN@vmkernel#nover+0x2d stack: 0x3110dc
 PShare_RemoveHint@vmkernel#nover+0xb3 stack: 0x4391ccaa7000
 VmMemCow_PShareRemoveHint@vmkernel#nover+0x72 stack: 0x4391ccc1bef8
 VmMemCowPFrameRemoveHint@vmkernel#nover+0xc6 stack: 0x304
 VmMemCowPShareFn@vmkernel#nover+0x5c3 stack: 0x6422bec
 VmAssistantProcessTasks@vmkernel#nover+0x144 stack: 0x0
 CpuSched_StartWorld@vmkernel#nover+0x99 stack: 0x0

Multiple CPUs getting locked:

10:49:52.654Z cpu4:155673)WARNING: Heartbeat: 794: PCPU 51 didn't have a heartbeat for 9 seconds; *may* be locked up.
10:49:52.654Z cpu52:172048)WARNING: Heartbeat: 794: PCPU 27 didn't have a heartbeat for 12 seconds; *may* be locked

Cause: This is due to the memory corruption cause when MPN mapped to VMX get incorrectly updated.

Resolution: The issue has been fixed in the release of 6.5 P02, i.e. build# 7312210 released on 2017-12-19.

Leave a comment