Christopher Samuel
2018-09-10 01:04:37 UTC
Hi folks,
We've had 2 different nodes crash over the past few days with kernel
panics triggered by (what is recorded as) a "simd exception" (console
messages below). In both cases the triggering application is given as
the same binary, a user application built against OpenFOAM v16.06.
This doesn't happen every time, I can see about 28 successful runs of
the application this month (the binary was built at the end of August).
The system in question has 2 x 16C Xeon Gold 6140 Skylake-EP CPUs.
Any ideas?
------------------8< snip snip 8<------------------
2018-09-09 17:14:34 [179203.697285] simd exception: 0000 [#1] SMP
2018-09-09 17:14:34 [179203.701527] Modules linked in: squashfs loop
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
intel_powerclamp coretemp intel_rapl iosf_mbi irqbypass crc32_pclmul
ghash_clmulni_intel iTCO_wdt iTCO_vendor_support rdma_ucm ib_ucm dcdbas
aesni_intel mgag200 lrw gf128mul glue_helper ablk_helper ttm ib_uverbs
cryptd drm_kms_helper dm_mod syscopyarea sysfillrect ib_umad sysimgblt
fb_sys_fops drm mei_me sg ipmi_si mei lpc_ich i2c_i801 shpchp nfit
ipmi_devintf ipmi_msghandler libnvdimm tpm_crb acpi_pad acpi_power_meter
binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE)
mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE)
obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sr_mod cdrom sd_mod
crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit i2c_core ahci
crct10dif_pclmul crct10dif_common crc32c_intel libahci ib_core libata
megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
2018-09-09 17:14:34 [179203.784359] CPU: 2 PID: 159455 Comm:
shuangTwoPhaseE Tainted: P OE ------------ T
3.10.0-862.9.1.el7.x86_64 #1
2018-09-09 17:14:34 [179203.795389] Hardware name: Dell Inc. PowerEdge
R740/06G98X, BIOS 1.4.8 05/21/2018
2018-09-09 17:14:34 [179203.802958] task: ffff995c1aee8fd0 ti:
ffff995c1988c000 task.ti: ffff995c1988c000
2018-09-09 17:14:34 [179203.810539] RIP: 0010:[<ffffffffbe121791>]
[<ffffffffbe121791>] apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.819515] RSP: 0000:ffff995c1da46200 EFLAGS:
00010082
2018-09-09 17:14:34 [179203.824928] RAX: ffff995c1988ff70 RBX:
0000000001a95e00 RCX: 0000000000000090
2018-09-09 17:14:34 [179203.832146] RDX: 0000000000000000 RSI:
ffff995c1da46200 RDI: ffff995c1988ff70
2018-09-09 17:14:34 [179203.839364] RBP: 00007ffd8b8ba848 R08:
0000000000000c40 R09: 0000000000000031
2018-09-09 17:14:34 [179203.846591] R10: 0000000000000000 R11:
0000000000e72148 R12: 0000000001c4e770
2018-09-09 17:14:34 [179203.853827] R13: 0000000000000007 R14:
00000000011935b0 R15: 0000000000000038
2018-09-09 17:14:34 [179203.861040] FS: 00002ad83f7afa00(0000)
GS:ffff995c1da40000(0000) knlGS:0000000000000000
2018-09-09 17:14:34 [179203.869213] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
2018-09-09 17:14:34 [179203.875042] CR2: 0000000002a18000 CR3:
00000017963f8000 CR4: 00000000007607e0
2018-09-09 17:14:34 [179203.882274] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
2018-09-09 17:14:34 [179203.889495] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
2018-09-09 17:14:34 [179203.896714] PKRU: 55555554
2018-09-09 17:14:34 [179203.899530] Call Trace:
2018-09-09 17:14:34 [179203.902065] Code: 48 39 cc 77 2f 48 8d 81 00 fe
ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83 c7 28
48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04 25 60
0e 01 00 65 48 0f 44
2018-09-09 17:14:34 [179203.922628] RIP [<ffffffffbe121791>]
apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.929259] RSP <ffff995c1da46200>
2018-09-09 17:14:34 [179203.933970] ---[ end trace 3912e5e8b3b86da4 ]---
2018-09-09 17:14:34 [179203.984039] Kernel panic - not syncing: Fatal
exception
2018-09-09 17:14:34 [179203.989451] Kernel Offset: 0x3ca00000 from
0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
------------------8< snip snip 8<------------------
------------------8< snip snip 8<------------------
2018-09-07 22:37:16 [201527.171417] simd exception: 0000 [#1] SMP
2018-09-07 22:37:16 [201527.176270] Modules linked in: squashfs loop
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
intel_powerclamp coretemp intel_rapl iosf_mbi mgag200 ttm drm_kms_helper
irqbypass syscopyarea sysfillrect crc32_pclmul sysimgblt iTCO_wdt
fb_sys_fops ib_ucm iTCO_vendor_support ghash_clmulni_intel rdma_ucm
dm_mod dcdbas drm ib_uverbs aesni_intel lrw gf128mul glue_helper
ablk_helper cryptd mei_me sg lpc_ich i2c_i801 shpchp ib_umad mei ipmi_si
ipmi_devintf ipmi_msghandler nfit libnvdimm tpm_crb acpi_pad
acpi_power_meter binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE)
lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm
ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sd_mod sr_mod
cdrom crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit ahci
i2c_core crct10dif_pclmul libahci crct10dif_common crc32c_intel ib_core
libata megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
2018-09-07 22:37:16 [201527.264079] CPU: 17 PID: 32227 Comm:
shuangTwoPhaseE Tainted: P W OE ------------ T
3.10.0-862.9.1.el7.x86_64 #1
2018-09-07 22:37:16 [201527.275789] Hardware name: Dell Inc. PowerEdge
R740/06G98X, BIOS 1.4.8 05/21/2018
2018-09-07 22:37:16 [201527.284045] task: ffff9e345a42eeb0 ti:
ffff9e2f88a0c000 task.ti: ffff9e2f88a0c000
2018-09-07 22:37:16 [201527.292302] RIP: 0010:[<ffffffffa6721791>]
[<ffffffffa6721791>] apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.301978] RSP: 0000:ffff9e345c006200 EFLAGS:
00010082
2018-09-07 22:37:16 [201527.308091] RAX: ffff9e2f88a0ff70 RBX:
00007fffe0d21d78 RCX: 0000000000000090
2018-09-07 22:37:16 [201527.316032] RDX: 0000000000000000 RSI:
ffff9e345c006200 RDI: ffff9e2f88a0ff70
2018-09-07 22:37:16 [201527.323969] RBP: 00007fffe0d21d78 R08:
000000000001e800 R09: 00000000000007a0
2018-09-07 22:37:16 [201527.331906] R10: 0000000000000000 R11:
0000000002818868 R12: 0000000002d20790
2018-09-07 22:37:16 [201527.339839] R13: 00007fffe0d159d0 R14:
00007fffe0d15b40 R15: 00007fffe0d15a20
2018-09-07 22:37:16 [201527.347772] FS: 00002b835d26da00(0000)
GS:ffff9e345c000000(0000) knlGS:0000000000000000
2018-09-07 22:37:16 [201527.356659] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
2018-09-07 22:37:16 [201527.363209] CR2: 0000000003a6ff88 CR3:
0000002fdd8f6000 CR4: 00000000007607e0
2018-09-07 22:37:16 [201527.371144] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
2018-09-07 22:37:16 [201527.379079] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
2018-09-07 22:37:16 [201527.387010] PKRU: 55555554
2018-09-07 22:37:16 [201527.390523] Call Trace:
2018-09-07 22:37:16 [201527.393780] Code: 48 39 cc 77 2f 48 8d 81 00 fe
ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83 c7 28
48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04 25 60
0e 01 00 65 48 0f 44
2018-09-07 22:37:16 [201527.415810] RIP [<ffffffffa6721791>]
apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.423189] RSP <ffff9e345c006200>
2018-09-07 22:37:16 [201527.428646] ---[ end trace a6a14aed798e889f ]---
2018-09-07 22:37:17 [201527.477875] Kernel panic - not syncing: Fatal
exception
2018-09-07 22:37:17 [201527.484041] Kernel Offset: 0x25000000 from
0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
------------------8< snip snip 8<------------------
All the best!
Chris
We've had 2 different nodes crash over the past few days with kernel
panics triggered by (what is recorded as) a "simd exception" (console
messages below). In both cases the triggering application is given as
the same binary, a user application built against OpenFOAM v16.06.
This doesn't happen every time, I can see about 28 successful runs of
the application this month (the binary was built at the end of August).
The system in question has 2 x 16C Xeon Gold 6140 Skylake-EP CPUs.
Any ideas?
------------------8< snip snip 8<------------------
2018-09-09 17:14:34 [179203.697285] simd exception: 0000 [#1] SMP
2018-09-09 17:14:34 [179203.701527] Modules linked in: squashfs loop
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
intel_powerclamp coretemp intel_rapl iosf_mbi irqbypass crc32_pclmul
ghash_clmulni_intel iTCO_wdt iTCO_vendor_support rdma_ucm ib_ucm dcdbas
aesni_intel mgag200 lrw gf128mul glue_helper ablk_helper ttm ib_uverbs
cryptd drm_kms_helper dm_mod syscopyarea sysfillrect ib_umad sysimgblt
fb_sys_fops drm mei_me sg ipmi_si mei lpc_ich i2c_i801 shpchp nfit
ipmi_devintf ipmi_msghandler libnvdimm tpm_crb acpi_pad acpi_power_meter
binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE)
mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE)
obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sr_mod cdrom sd_mod
crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit i2c_core ahci
crct10dif_pclmul crct10dif_common crc32c_intel libahci ib_core libata
megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
2018-09-09 17:14:34 [179203.784359] CPU: 2 PID: 159455 Comm:
shuangTwoPhaseE Tainted: P OE ------------ T
3.10.0-862.9.1.el7.x86_64 #1
2018-09-09 17:14:34 [179203.795389] Hardware name: Dell Inc. PowerEdge
R740/06G98X, BIOS 1.4.8 05/21/2018
2018-09-09 17:14:34 [179203.802958] task: ffff995c1aee8fd0 ti:
ffff995c1988c000 task.ti: ffff995c1988c000
2018-09-09 17:14:34 [179203.810539] RIP: 0010:[<ffffffffbe121791>]
[<ffffffffbe121791>] apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.819515] RSP: 0000:ffff995c1da46200 EFLAGS:
00010082
2018-09-09 17:14:34 [179203.824928] RAX: ffff995c1988ff70 RBX:
0000000001a95e00 RCX: 0000000000000090
2018-09-09 17:14:34 [179203.832146] RDX: 0000000000000000 RSI:
ffff995c1da46200 RDI: ffff995c1988ff70
2018-09-09 17:14:34 [179203.839364] RBP: 00007ffd8b8ba848 R08:
0000000000000c40 R09: 0000000000000031
2018-09-09 17:14:34 [179203.846591] R10: 0000000000000000 R11:
0000000000e72148 R12: 0000000001c4e770
2018-09-09 17:14:34 [179203.853827] R13: 0000000000000007 R14:
00000000011935b0 R15: 0000000000000038
2018-09-09 17:14:34 [179203.861040] FS: 00002ad83f7afa00(0000)
GS:ffff995c1da40000(0000) knlGS:0000000000000000
2018-09-09 17:14:34 [179203.869213] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
2018-09-09 17:14:34 [179203.875042] CR2: 0000000002a18000 CR3:
00000017963f8000 CR4: 00000000007607e0
2018-09-09 17:14:34 [179203.882274] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
2018-09-09 17:14:34 [179203.889495] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
2018-09-09 17:14:34 [179203.896714] PKRU: 55555554
2018-09-09 17:14:34 [179203.899530] Call Trace:
2018-09-09 17:14:34 [179203.902065] Code: 48 39 cc 77 2f 48 8d 81 00 fe
ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83 c7 28
48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04 25 60
0e 01 00 65 48 0f 44
2018-09-09 17:14:34 [179203.922628] RIP [<ffffffffbe121791>]
apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.929259] RSP <ffff995c1da46200>
2018-09-09 17:14:34 [179203.933970] ---[ end trace 3912e5e8b3b86da4 ]---
2018-09-09 17:14:34 [179203.984039] Kernel panic - not syncing: Fatal
exception
2018-09-09 17:14:34 [179203.989451] Kernel Offset: 0x3ca00000 from
0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
------------------8< snip snip 8<------------------
------------------8< snip snip 8<------------------
2018-09-07 22:37:16 [201527.171417] simd exception: 0000 [#1] SMP
2018-09-07 22:37:16 [201527.176270] Modules linked in: squashfs loop
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
intel_powerclamp coretemp intel_rapl iosf_mbi mgag200 ttm drm_kms_helper
irqbypass syscopyarea sysfillrect crc32_pclmul sysimgblt iTCO_wdt
fb_sys_fops ib_ucm iTCO_vendor_support ghash_clmulni_intel rdma_ucm
dm_mod dcdbas drm ib_uverbs aesni_intel lrw gf128mul glue_helper
ablk_helper cryptd mei_me sg lpc_ich i2c_i801 shpchp ib_umad mei ipmi_si
ipmi_devintf ipmi_msghandler nfit libnvdimm tpm_crb acpi_pad
acpi_power_meter binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE)
lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm
ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sd_mod sr_mod
cdrom crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit ahci
i2c_core crct10dif_pclmul libahci crct10dif_common crc32c_intel ib_core
libata megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
2018-09-07 22:37:16 [201527.264079] CPU: 17 PID: 32227 Comm:
shuangTwoPhaseE Tainted: P W OE ------------ T
3.10.0-862.9.1.el7.x86_64 #1
2018-09-07 22:37:16 [201527.275789] Hardware name: Dell Inc. PowerEdge
R740/06G98X, BIOS 1.4.8 05/21/2018
2018-09-07 22:37:16 [201527.284045] task: ffff9e345a42eeb0 ti:
ffff9e2f88a0c000 task.ti: ffff9e2f88a0c000
2018-09-07 22:37:16 [201527.292302] RIP: 0010:[<ffffffffa6721791>]
[<ffffffffa6721791>] apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.301978] RSP: 0000:ffff9e345c006200 EFLAGS:
00010082
2018-09-07 22:37:16 [201527.308091] RAX: ffff9e2f88a0ff70 RBX:
00007fffe0d21d78 RCX: 0000000000000090
2018-09-07 22:37:16 [201527.316032] RDX: 0000000000000000 RSI:
ffff9e345c006200 RDI: ffff9e2f88a0ff70
2018-09-07 22:37:16 [201527.323969] RBP: 00007fffe0d21d78 R08:
000000000001e800 R09: 00000000000007a0
2018-09-07 22:37:16 [201527.331906] R10: 0000000000000000 R11:
0000000002818868 R12: 0000000002d20790
2018-09-07 22:37:16 [201527.339839] R13: 00007fffe0d159d0 R14:
00007fffe0d15b40 R15: 00007fffe0d15a20
2018-09-07 22:37:16 [201527.347772] FS: 00002b835d26da00(0000)
GS:ffff9e345c000000(0000) knlGS:0000000000000000
2018-09-07 22:37:16 [201527.356659] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
2018-09-07 22:37:16 [201527.363209] CR2: 0000000003a6ff88 CR3:
0000002fdd8f6000 CR4: 00000000007607e0
2018-09-07 22:37:16 [201527.371144] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
2018-09-07 22:37:16 [201527.379079] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
2018-09-07 22:37:16 [201527.387010] PKRU: 55555554
2018-09-07 22:37:16 [201527.390523] Call Trace:
2018-09-07 22:37:16 [201527.393780] Code: 48 39 cc 77 2f 48 8d 81 00 fe
ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83 c7 28
48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04 25 60
0e 01 00 65 48 0f 44
2018-09-07 22:37:16 [201527.415810] RIP [<ffffffffa6721791>]
apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.423189] RSP <ffff9e345c006200>
2018-09-07 22:37:16 [201527.428646] ---[ end trace a6a14aed798e889f ]---
2018-09-07 22:37:17 [201527.477875] Kernel panic - not syncing: Fatal
exception
2018-09-07 22:37:17 [201527.484041] Kernel Offset: 0x25000000 from
0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
------------------8< snip snip 8<------------------
All the best!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf