Chris Samuel
2017-10-26 11:42:51 UTC
Hi folks,
I'm helping another group out and we've found that running an Open-MPI
program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5
cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS
or other diagnostics and has to be power cycled.
Disabling openib/verbs support with:
export OMPI_MCA_btl=tcp,self,vader
stops the crashes, and whilst it's hard to tell strace seems to imply it hangs
when trying to probe for openib/verbs devices (or shortly after).
Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and I'm
reasonably convinced this has to be a driver bug, or perhaps a bad interaction
with recent 4.11.x and 4.12.x kernels (they need those for CephFS).
They've got a bug open with Mellanox already but I was wondering if anyone
else had seen anything similar?
cheers!
Chris
I'm helping another group out and we've found that running an Open-MPI
program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5
cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS
or other diagnostics and has to be power cycled.
Disabling openib/verbs support with:
export OMPI_MCA_btl=tcp,self,vader
stops the crashes, and whilst it's hard to tell strace seems to imply it hangs
when trying to probe for openib/verbs devices (or shortly after).
Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and I'm
reasonably convinced this has to be a driver bug, or perhaps a bad interaction
with recent 4.11.x and 4.12.x kernels (they need those for CephFS).
They've got a bug open with Mellanox already but I was wondering if anyone
else had seen anything similar?
cheers!
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/