Discussion:
[Beowulf] SIMD exception kernel panic on Skylake-EP triggered by OpenFOAM?
Christopher Samuel
2018-09-10 01:04:37 UTC
Permalink
Hi folks,

We've had 2 different nodes crash over the past few days with kernel
panics triggered by (what is recorded as) a "simd exception" (console
messages below). In both cases the triggering application is given as
the same binary, a user application built against OpenFOAM v16.06.

This doesn't happen every time, I can see about 28 successful runs of
the application this month (the binary was built at the end of August).

The system in question has 2 x 16C Xeon Gold 6140 Skylake-EP CPUs.

Any ideas?


------------------8< snip snip 8<------------------

2018-09-09 17:14:34 [179203.697285] simd exception: 0000 [#1] SMP
2018-09-09 17:14:34 [179203.701527] Modules linked in: squashfs loop
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
intel_powerclamp coretemp intel_rapl iosf_mbi irqbypass crc32_pclmul
ghash_clmulni_intel iTCO_wdt iTCO_vendor_support rdma_ucm ib_ucm dcdbas
aesni_intel mgag200 lrw gf128mul glue_helper ablk_helper ttm ib_uverbs
cryptd drm_kms_helper dm_mod syscopyarea sysfillrect ib_umad sysimgblt
fb_sys_fops drm mei_me sg ipmi_si mei lpc_ich i2c_i801 shpchp nfit
ipmi_devintf ipmi_msghandler libnvdimm tpm_crb acpi_pad acpi_power_meter
binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE)
mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE)
obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sr_mod cdrom sd_mod
crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit i2c_core ahci
crct10dif_pclmul crct10dif_common crc32c_intel libahci ib_core libata
megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
2018-09-09 17:14:34 [179203.784359] CPU: 2 PID: 159455 Comm:
shuangTwoPhaseE Tainted: P OE ------------ T
3.10.0-862.9.1.el7.x86_64 #1
2018-09-09 17:14:34 [179203.795389] Hardware name: Dell Inc. PowerEdge
R740/06G98X, BIOS 1.4.8 05/21/2018
2018-09-09 17:14:34 [179203.802958] task: ffff995c1aee8fd0 ti:
ffff995c1988c000 task.ti: ffff995c1988c000
2018-09-09 17:14:34 [179203.810539] RIP: 0010:[<ffffffffbe121791>]
[<ffffffffbe121791>] apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.819515] RSP: 0000:ffff995c1da46200 EFLAGS:
00010082
2018-09-09 17:14:34 [179203.824928] RAX: ffff995c1988ff70 RBX:
0000000001a95e00 RCX: 0000000000000090
2018-09-09 17:14:34 [179203.832146] RDX: 0000000000000000 RSI:
ffff995c1da46200 RDI: ffff995c1988ff70
2018-09-09 17:14:34 [179203.839364] RBP: 00007ffd8b8ba848 R08:
0000000000000c40 R09: 0000000000000031
2018-09-09 17:14:34 [179203.846591] R10: 0000000000000000 R11:
0000000000e72148 R12: 0000000001c4e770
2018-09-09 17:14:34 [179203.853827] R13: 0000000000000007 R14:
00000000011935b0 R15: 0000000000000038
2018-09-09 17:14:34 [179203.861040] FS: 00002ad83f7afa00(0000)
GS:ffff995c1da40000(0000) knlGS:0000000000000000
2018-09-09 17:14:34 [179203.869213] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
2018-09-09 17:14:34 [179203.875042] CR2: 0000000002a18000 CR3:
00000017963f8000 CR4: 00000000007607e0
2018-09-09 17:14:34 [179203.882274] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
2018-09-09 17:14:34 [179203.889495] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
2018-09-09 17:14:34 [179203.896714] PKRU: 55555554
2018-09-09 17:14:34 [179203.899530] Call Trace:
2018-09-09 17:14:34 [179203.902065] Code: 48 39 cc 77 2f 48 8d 81 00 fe
ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83 c7 28
48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04 25 60
0e 01 00 65 48 0f 44
2018-09-09 17:14:34 [179203.922628] RIP [<ffffffffbe121791>]
apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.929259] RSP <ffff995c1da46200>
2018-09-09 17:14:34 [179203.933970] ---[ end trace 3912e5e8b3b86da4 ]---
2018-09-09 17:14:34 [179203.984039] Kernel panic - not syncing: Fatal
exception
2018-09-09 17:14:34 [179203.989451] Kernel Offset: 0x3ca00000 from
0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

------------------8< snip snip 8<------------------

------------------8< snip snip 8<------------------

2018-09-07 22:37:16 [201527.171417] simd exception: 0000 [#1] SMP
2018-09-07 22:37:16 [201527.176270] Modules linked in: squashfs loop
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
intel_powerclamp coretemp intel_rapl iosf_mbi mgag200 ttm drm_kms_helper
irqbypass syscopyarea sysfillrect crc32_pclmul sysimgblt iTCO_wdt
fb_sys_fops ib_ucm iTCO_vendor_support ghash_clmulni_intel rdma_ucm
dm_mod dcdbas drm ib_uverbs aesni_intel lrw gf128mul glue_helper
ablk_helper cryptd mei_me sg lpc_ich i2c_i801 shpchp ib_umad mei ipmi_si
ipmi_devintf ipmi_msghandler nfit libnvdimm tpm_crb acpi_pad
acpi_power_meter binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE)
lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm
ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sd_mod sr_mod
cdrom crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit ahci
i2c_core crct10dif_pclmul libahci crct10dif_common crc32c_intel ib_core
libata megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
2018-09-07 22:37:16 [201527.264079] CPU: 17 PID: 32227 Comm:
shuangTwoPhaseE Tainted: P W OE ------------ T
3.10.0-862.9.1.el7.x86_64 #1
2018-09-07 22:37:16 [201527.275789] Hardware name: Dell Inc. PowerEdge
R740/06G98X, BIOS 1.4.8 05/21/2018
2018-09-07 22:37:16 [201527.284045] task: ffff9e345a42eeb0 ti:
ffff9e2f88a0c000 task.ti: ffff9e2f88a0c000
2018-09-07 22:37:16 [201527.292302] RIP: 0010:[<ffffffffa6721791>]
[<ffffffffa6721791>] apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.301978] RSP: 0000:ffff9e345c006200 EFLAGS:
00010082
2018-09-07 22:37:16 [201527.308091] RAX: ffff9e2f88a0ff70 RBX:
00007fffe0d21d78 RCX: 0000000000000090
2018-09-07 22:37:16 [201527.316032] RDX: 0000000000000000 RSI:
ffff9e345c006200 RDI: ffff9e2f88a0ff70
2018-09-07 22:37:16 [201527.323969] RBP: 00007fffe0d21d78 R08:
000000000001e800 R09: 00000000000007a0
2018-09-07 22:37:16 [201527.331906] R10: 0000000000000000 R11:
0000000002818868 R12: 0000000002d20790
2018-09-07 22:37:16 [201527.339839] R13: 00007fffe0d159d0 R14:
00007fffe0d15b40 R15: 00007fffe0d15a20
2018-09-07 22:37:16 [201527.347772] FS: 00002b835d26da00(0000)
GS:ffff9e345c000000(0000) knlGS:0000000000000000
2018-09-07 22:37:16 [201527.356659] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
2018-09-07 22:37:16 [201527.363209] CR2: 0000000003a6ff88 CR3:
0000002fdd8f6000 CR4: 00000000007607e0
2018-09-07 22:37:16 [201527.371144] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
2018-09-07 22:37:16 [201527.379079] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
2018-09-07 22:37:16 [201527.387010] PKRU: 55555554
2018-09-07 22:37:16 [201527.390523] Call Trace:
2018-09-07 22:37:16 [201527.393780] Code: 48 39 cc 77 2f 48 8d 81 00 fe
ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83 c7 28
48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04 25 60
0e 01 00 65 48 0f 44
2018-09-07 22:37:16 [201527.415810] RIP [<ffffffffa6721791>]
apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.423189] RSP <ffff9e345c006200>
2018-09-07 22:37:16 [201527.428646] ---[ end trace a6a14aed798e889f ]---
2018-09-07 22:37:17 [201527.477875] Kernel panic - not syncing: Fatal
exception
2018-09-07 22:37:17 [201527.484041] Kernel Offset: 0x25000000 from
0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

------------------8< snip snip 8<------------------


All the best!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Joe Landman
2018-09-10 01:16:56 UTC
Permalink
I've not seen this one, but looking around a bit, I am wondering if the
code path hit a denormal underflow in a SIMD instruction, and didn't
have the appropriate SIMD exception mask.  See
https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz
for info.

Basically, if there is a SIMD exception, and the exception isn't masked
off with an FTZ or similar, or an exception interrupt handler
registered, it could wind up somewhere like this.

If you have dumps from the crash, you could load them up in the
debugger.  Would be the most accurate route to determine why that was
triggered.
Post by Christopher Samuel
Hi folks,
We've had 2 different nodes crash over the past few days with kernel
panics triggered by (what is recorded as) a "simd exception" (console
messages below). In both cases the triggering application is given as
the same binary, a user application built against OpenFOAM v16.06.
This doesn't happen every time, I can see about 28 successful runs of
the application this month (the binary was built at the end of August).
The system in question has 2 x 16C Xeon Gold 6140 Skylake-EP CPUs.
Any ideas?
------------------8< snip snip 8<------------------
2018-09-09 17:14:34 [179203.697285] simd exception: 0000 [#1] SMP
2018-09-09 17:14:34 [179203.701527] Modules linked in: squashfs loop
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
intel_powerclamp coretemp intel_rapl iosf_mbi irqbypass crc32_pclmul
ghash_clmulni_intel iTCO_wdt iTCO_vendor_support rdma_ucm ib_ucm
dcdbas aesni_intel mgag200 lrw gf128mul glue_helper ablk_helper ttm
ib_uverbs cryptd drm_kms_helper dm_mod syscopyarea sysfillrect ib_umad
sysimgblt fb_sys_fops drm mei_me sg ipmi_si mei lpc_ich i2c_i801
shpchp nfit ipmi_devintf ipmi_msghandler libnvdimm tpm_crb acpi_pad
acpi_power_meter binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE)
lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm
ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sr_mod
cdrom sd_mod crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit
i2c_core ahci crct10dif_pclmul crct10dif_common crc32c_intel libahci
ib_core libata megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
shuangTwoPhaseE Tainted: P           OE  ------------ T
3.10.0-862.9.1.el7.x86_64 #1
2018-09-09 17:14:34 [179203.795389] Hardware name: Dell Inc. PowerEdge
R740/06G98X, BIOS 1.4.8 05/21/2018
ffff995c1988c000 task.ti: ffff995c1988c000
2018-09-09 17:14:34 [179203.810539] RIP: 0010:[<ffffffffbe121791>]
[<ffffffffbe121791>] apic_timer_interrupt+0x141/0x170
00010082
0000000001a95e00 RCX: 0000000000000090
ffff995c1da46200 RDI: ffff995c1988ff70
0000000000000c40 R09: 0000000000000031
0000000000e72148 R12: 0000000001c4e770
00000000011935b0 R15: 0000000000000038
2018-09-09 17:14:34 [179203.861040] FS:  00002ad83f7afa00(0000)
GS:ffff995c1da40000(0000) knlGS:0000000000000000
0000000080050033
00000017963f8000 CR4: 00000000007607e0
0000000000000000 DR2: 0000000000000000
00000000fffe0ff0 DR7: 0000000000000400
2018-09-09 17:14:34 [179203.896714] PKRU: 55555554
2018-09-09 17:14:34 [179203.902065] Code: 48 39 cc 77 2f 48 8d 81 00
fe ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83
c7 28 48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04
25 60 0e 01 00 65 48 0f 44
2018-09-09 17:14:34 [179203.922628] RIP [<ffffffffbe121791>]
apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.929259]  RSP <ffff995c1da46200>
2018-09-09 17:14:34 [179203.933970] ---[ end trace 3912e5e8b3b86da4 ]---
2018-09-09 17:14:34 [179203.984039] Kernel panic - not syncing: Fatal
exception
2018-09-09 17:14:34 [179203.989451] Kernel Offset: 0x3ca00000 from
0xffffffff80000000-0xffffffffbfffffff)
------------------8< snip snip 8<------------------
------------------8< snip snip 8<------------------
2018-09-07 22:37:16 [201527.171417] simd exception: 0000 [#1] SMP
2018-09-07 22:37:16 [201527.176270] Modules linked in: squashfs loop
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
intel_powerclamp coretemp intel_rapl iosf_mbi mgag200 ttm
drm_kms_helper irqbypass syscopyarea sysfillrect crc32_pclmul
sysimgblt iTCO_wdt fb_sys_fops ib_ucm iTCO_vendor_support
ghash_clmulni_intel rdma_ucm dm_mod dcdbas drm ib_uverbs aesni_intel
lrw gf128mul glue_helper ablk_helper cryptd mei_me sg lpc_ich i2c_i801
shpchp ib_umad mei ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm
tpm_crb acpi_pad acpi_power_meter binfmt_misc overlay(OET) osc(OE)
mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE)
ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE)
ib_ipoib ib_cm sd_mod sr_mod cdrom crc_t10dif crct10dif_generic hfi1
rdmavt i2c_algo_bit ahci i2c_core crct10dif_pclmul libahci
crct10dif_common crc32c_intel ib_core libata megaraid_sas pps_core
libcrc32c [last unloaded: pcspkr]
shuangTwoPhaseE Tainted: P        W  OE  ------------ T
3.10.0-862.9.1.el7.x86_64 #1
2018-09-07 22:37:16 [201527.275789] Hardware name: Dell Inc. PowerEdge
R740/06G98X, BIOS 1.4.8 05/21/2018
ffff9e2f88a0c000 task.ti: ffff9e2f88a0c000
2018-09-07 22:37:16 [201527.292302] RIP: 0010:[<ffffffffa6721791>]
[<ffffffffa6721791>] apic_timer_interrupt+0x141/0x170
00010082
00007fffe0d21d78 RCX: 0000000000000090
ffff9e345c006200 RDI: ffff9e2f88a0ff70
000000000001e800 R09: 00000000000007a0
0000000002818868 R12: 0000000002d20790
00007fffe0d15b40 R15: 00007fffe0d15a20
2018-09-07 22:37:16 [201527.347772] FS:  00002b835d26da00(0000)
GS:ffff9e345c000000(0000) knlGS:0000000000000000
0000000080050033
0000002fdd8f6000 CR4: 00000000007607e0
0000000000000000 DR2: 0000000000000000
00000000fffe0ff0 DR7: 0000000000000400
2018-09-07 22:37:16 [201527.387010] PKRU: 55555554
2018-09-07 22:37:16 [201527.393780] Code: 48 39 cc 77 2f 48 8d 81 00
fe ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83
c7 28 48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04
25 60 0e 01 00 65 48 0f 44
2018-09-07 22:37:16 [201527.415810] RIP [<ffffffffa6721791>]
apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.423189]  RSP <ffff9e345c006200>
2018-09-07 22:37:16 [201527.428646] ---[ end trace a6a14aed798e889f ]---
2018-09-07 22:37:17 [201527.477875] Kernel panic - not syncing: Fatal
exception
2018-09-07 22:37:17 [201527.484041] Kernel Offset: 0x25000000 from
0xffffffff80000000-0xffffffffbfffffff)
------------------8< snip snip 8<------------------
All the best!
Chris
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http
Christopher Samuel
2018-09-10 01:40:32 UTC
Permalink
Post by Joe Landman
If you have dumps from the crash, you could load them up in the
debugger. Would be the most accurate route to determine why that was
triggered.
Thanks Joe! Looking at our nodes I don't think we've got crash dumps
enabled, I'll see if we can get that done.

Looking at the users code there's no assembler there (all C++) so
I'm starting to think this might be the result of a compiler bug?

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/li
Christopher Samuel
2018-09-26 01:42:09 UTC
Permalink
Post by Joe Landman
If you have dumps from the crash, you could load them up in the
debugger. Would be the most accurate route to determine why that was
triggered.
Thanks Joe, after a bit of experimentation we've now successfully got a
crash dump. It seems to confirm what I thought was the case, in that the
process is off in kernel space dealing with an APIC interrupt (a timer
in this case) when a SIMD exception gets raised.

crash> bt
PID: 138341 TASK: ffff9fd7eb3c6eb0 CPU: 27 COMMAND: "shuangTwoPhaseE"
#0 [ffff9ff02ee6bc38] machine_kexec at ffffffff938629da
#1 [ffff9ff02ee6bc98] __crash_kexec at ffffffff93916692
#2 [ffff9ff02ee6bd68] crash_kexec at ffffffff93916780
#3 [ffff9ff02ee6bd80] oops_end at ffffffff93f1d738
#4 [ffff9ff02ee6bda8] die at ffffffff9382f96b
#5 [ffff9ff02ee6bdd8] math_error at ffffffff9382cca8
#6 [ffff9ff02ee6be98] do_simd_coprocessor_error at ffffffff9382cec8
#7 [ffff9ff02ee6bec0] simd_coprocessor_error at ffffffff93f28c9e
#8 [ffff9ff02ee6bf48] apic_timer_interrupt at ffffffff93f26791
RIP: 00002b1b5d406828 RSP: 00007fff1f596148 RFLAGS: 00000293
RAX: 00000000000005c8 RBX: 0000000000002bce RCX: 0000000002c979e0
RDX: 00000000000005cb RSI: 0000000002dcedf0 RDI: 00000000000000b9
RBP: 00007fff1f5a25d8 R8: 0000000000002d00 R9: 00000000000000b4
R10: 0000000000000000 R11: 00000000026bcb48 R12: ffff9ff05c1461e8
R13: 0000000000000000 R14: ffff9ff05c146200 R15: 0000000000010082
ORIG_RAX: ffffffffffffff10 CS: 0033 SS: 002b

The kernel code is pretty short for it, basically in the RHEL7 kernel
it comes down to:

Are we in user space?
No? Oh dear.
Is there a fixup registered for this address?
No? OK, goodbye cruel world...

I've reached out to the maintainers of the arch/x86/ part of the tree
in case they had any general ideas on whether this was all the kernel
could be expected to do. Only feedback so far is that yes this is odd,
and a query to another developer regarding whether some additional
checks that are done for when the process is in user space might be
applicable if that process has called into the kernel at that point.

My suspicion is that is the process is off doing some AVX stuff when
the timer occurs and an exception is either generated or just happens
to be delivered from the AVX unit at a bad time.

Going to see if I can persuade Easybuild to compile OpenFOAM without
AVX-512 optimisations first and try (if that doesn't fix it) turn off
different things until the problem goes away.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mail
Jonathan Engwall
2018-09-10 04:23:18 UTC
Permalink
If it is helpful there are a few similar bugs, generally considered unreproducible. One thread calls it bogus xcomp_bv...the kernel clobbers itself writing zeroes when that is not the state. And spectre came up. One suggestion is to disable IBRS; according to other sources IBRS is dangerous to disable and should protect against Spectre. Maybe the OpenFOAM is to blame.

Something interesting about Spectre:
https://wiki.ubuntu.com/SecurityTeam/KnowledgeBase/SpectreAndMeltdown/MitigationControls

And something with a little similarity:
https://www.suse.com/support/kb/doc/?id=7017833

Unable to handle null pointer:
https://groups.google.com/forum/m/#!msg/linux.kernel/NQjqgvrJ18o/4DoP2nggAgAJ

And here with nvidia - I see you have nicks and it seems things went wrong with POE. Maybe this can help:
https://devtalk.nvidia.com/default/topic/972567/crash-in-centos-with-driver-319-76/


On September 9, 2018, at 6:05 PM, Christopher Samuel <***@csamuel.org> wrote:

Hi folks,

We've had 2 different nodes crash over the past few days with kernel
panics triggered by (what is recorded as) a "simd exception" (console
messages below). In both cases the triggering application is given as
the same binary, a user application built against OpenFOAM v16.06.

This doesn't happen every time, I can see about 28 successful runs of
the application this month (the binary was built at the end of August).

The system in question has 2 x 16C Xeon Gold 6140 Skylake-EP CPUs.

Any ideas?


------------------8< snip snip 8<------------------

2018-09-09 17:14:34 [179203.697285] simd exception: 0000 [#1] SMP
2018-09-09 17:14:34 [179203.701527] Modules linked in: squashfs loop
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
intel_powerclamp coretemp intel_rapl iosf_mbi irqbypass crc32_pclmul
ghash_clmulni_intel iTCO_wdt iTCO_vendor_support rdma_ucm ib_ucm dcdbas
aesni_intel mgag200 lrw gf128mul glue_helper ablk_helper ttm ib_uverbs
cryptd drm_kms_helper dm_mod syscopyarea sysfillrect ib_umad sysimgblt
fb_sys_fops drm mei_me sg ipmi_si mei lpc_ich i2c_i801 shpchp nfit
ipmi_devintf ipmi_msghandler libnvdimm tpm_crb acpi_pad acpi_power_meter
binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE)
mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE)
obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sr_mod cdrom sd_mod
crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit i2c_core ahci
crct10dif_pclmul crct10dif_common crc32c_intel libahci ib_core libata
megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
2018-09-09 17:14:34 [179203.784359] CPU: 2 PID: 159455 Comm:
shuangTwoPhaseE Tainted: P OE ------------ T
3.10.0-862.9.1.el7.x86_64 #1
2018-09-09 17:14:34 [179203.795389] Hardware name: Dell Inc. PowerEdge
R740/06G98X, BIOS 1.4.8 05/21/2018
2018-09-09 17:14:34 [179203.802958] task: ffff995c1aee8fd0 ti:
ffff995c1988c000 task.ti: ffff995c1988c000
2018-09-09 17:14:34 [179203.810539] RIP: 0010:[<ffffffffbe121791>]
[<ffffffffbe121791>] apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.819515] RSP: 0000:ffff995c1da46200 EFLAGS:
00010082
2018-09-09 17:14:34 [179203.824928] RAX: ffff995c1988ff70 RBX:
0000000001a95e00 RCX: 0000000000000090
2018-09-09 17:14:34 [179203.832146] RDX: 0000000000000000 RSI:
ffff995c1da46200 RDI: ffff995c1988ff70
2018-09-09 17:14:34 [179203.839364] RBP: 00007ffd8b8ba848 R08:
0000000000000c40 R09: 0000000000000031
2018-09-09 17:14:34 [179203.846591] R10: 0000000000000000 R11:
0000000000e72148 R12: 0000000001c4e770
2018-09-09 17:14:34 [179203.853827] R13: 0000000000000007 R14:
00000000011935b0 R15: 0000000000000038
2018-09-09 17:14:34 [179203.861040] FS: 00002ad83f7afa00(0000)
GS:ffff995c1da40000(0000) knlGS:0000000000000000
2018-09-09 17:14:34 [179203.869213] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
2018-09-09 17:14:34 [179203.875042] CR2: 0000000002a18000 CR3:
00000017963f8000 CR4: 00000000007607e0
2018-09-09 17:14:34 [179203.882274] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
2018-09-09 17:14:34 [179203.889495] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
2018-09-09 17:14:34 [179203.896714] PKRU: 55555554
2018-09-09 17:14:34 [179203.899530] Call Trace:
2018-09-09 17:14:34 [179203.902065] Code: 48 39 cc 77 2f 48 8d 81 00 fe
ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83 c7 28
48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04 25 60
0e 01 00 65 48 0f 44
2018-09-09 17:14:34 [179203.922628] RIP [<ffffffffbe121791>]
apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.929259] RSP <ffff995c1da46200>
2018-09-09 17:14:34 [179203.933970] ---[ end trace 3912e5e8b3b86da4 ]---
2018-09-09 17:14:34 [179203.984039] Kernel panic - not syncing: Fatal
exception
2018-09-09 17:14:34 [179203.989451] Kernel Offset: 0x3ca00000 from
0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

------------------8< snip snip 8<------------------

------------------8< snip snip 8<------------------

2018-09-07 22:37:16 [201527.171417] simd exception: 0000 [#1] SMP
2018-09-07 22:37:16 [201527.176270] Modules linked in: squashfs loop
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
intel_powerclamp coretemp intel_rapl iosf_mbi mgag200 ttm drm_kms_helper
irqbypass syscopyarea sysfillrect crc32_pclmul sysimgblt iTCO_wdt
fb_sys_fops ib_ucm iTCO_vendor_support ghash_clmulni_intel rdma_ucm
dm_mod dcdbas drm ib_uverbs aesni_intel lrw gf128mul glue_helper
ablk_helper cryptd mei_me sg lpc_ich i2c_i801 shpchp ib_umad mei ipmi_si
ipmi_devintf ipmi_msghandler nfit libnvdimm tpm_crb acpi_pad
acpi_power_meter binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE)
lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm
ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sd_mod sr_mod
cdrom crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit ahci
i2c_core crct10dif_pclmul libahci crct10dif_common crc32c_intel ib_core
libata megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
2018-09-07 22:37:16 [201527.264079] CPU: 17 PID: 32227 Comm:
shuangTwoPhaseE Tainted: P W OE ------------ T
3.10.0-862.9.1.el7.x86_64 #1
2018-09-07 22:37:16 [201527.275789] Hardware name: Dell Inc. PowerEdge
R740/06G98X, BIOS 1.4.8 05/21/2018
2018-09-07 22:37:16 [201527.284045] task: ffff9e345a42eeb0 ti:
ffff9e2f88a0c000 task.ti: ffff9e2f88a0c000
2018-09-07 22:37:16 [201527.292302] RIP: 0010:[<ffffffffa6721791>]
[<ffffffffa6721791>] apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.301978] RSP: 0000:ffff9e345c006200 EFLAGS:
00010082
2018-09-07 22:37:16 [201527.308091] RAX: ffff9e2f88a0ff70 RBX:
00007fffe0d21d78 RCX: 0000000000000090
2018-09-07 22:37:16 [201527.316032] RDX: 0000000000000000 RSI:
ffff9e345c006200 RDI: ffff9e2f88a0ff70
2018-09-07 22:37:16 [201527.323969] RBP: 00007fffe0d21d78 R08:
000000000001e800 R09: 00000000000007a0
2018-09-07 22:37:16 [201527.331906] R10: 0000000000000000 R11:
0000000002818868 R12: 0000000002d20790
2018-09-07 22:37:16 [201527.339839] R13: 00007fffe0d159d0 R14:
00007fffe0d15b40 R15: 00007fffe0d15a20
2018-09-07 22:37:16 [201527.347772] FS: 00002b835d26da00(0000)
GS:ffff9e345c000000(0000) knlGS:0000000000000000
2018-09-07 22:37:16 [201527.356659] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
2018-09-07 22:37:16 [201527.363209] CR2: 0000000003a6ff88 CR3:
0000002fdd8f6000 CR4: 00000000007607e0
2018-09-07 22:37:16 [201527.371144] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
2018-09-07 22:37:16 [201527.379079] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
2018-09-07 22:37:16 [201527.387010] PKRU: 55555554
2018-09-07 22:37:16 [201527.390523] Call Trace:
2018-09-07 22:37:16 [201527.393780] Code: 48 39 cc 77 2f 48 8d 81 00 fe
ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83 c7 28
48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04 25 60
0e 01 00 65 48 0f 44
2018-09-07 22:37:16 [201527.415810] RIP [<ffffffffa6721791>]
apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.423189] RSP <ffff9e345c006200>
2018-09-07 22:37:16 [201527.428646] ---[ end trace a6a14aed798e889f ]---
2018-09-07 22:37:17 [201527.477875] Kernel panic - not syncing: Fatal
exception
2018-09-07 22:37:17 [201527.484041] Kernel Offset: 0x25000000 from
0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

------------------8< snip snip 8<------------------


All the best!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/b
Chris Samuel
2018-09-12 07:40:13 UTC
Permalink
Post by Jonathan Engwall
If it is helpful there are a few similar bugs, generally considered
unreproducible. One thread calls it bogus xcomp_bv...the kernel clobbers
itself writing zeroes when that is not the state. And spectre came up. One
suggestion is to disable IBRS; according to other sources IBRS is dangerous
to disable and should protect against Spectre. Maybe the OpenFOAM is to
blame.
Yeah, I suspect what we're seeing is different to that, it looks like
something manages to generate a SIMD exception whilst the kernel is dealing
with an APIC timer interrupt. A colleague has backported this patch that I
found to our CentOS kernel in case it helps.

https://lore.kernel.org/patchwork/patch/953364/

For now we've constrained this users workload on to a handful of nodes as they
are trying to get some project work done.

All the best!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC


_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailm
Loading...