Discussion:
[Beowulf] Killing nodes with Open-MPI?
Chris Samuel
2017-10-26 11:42:51 UTC
Permalink
Hi folks,

I'm helping another group out and we've found that running an Open-MPI
program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5
cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS
or other diagnostics and has to be power cycled.

Disabling openib/verbs support with:

export OMPI_MCA_btl=tcp,self,vader

stops the crashes, and whilst it's hard to tell strace seems to imply it hangs
when trying to probe for openib/verbs devices (or shortly after).

Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and I'm
reasonably convinced this has to be a driver bug, or perhaps a bad interaction
with recent 4.11.x and 4.12.x kernels (they need those for CephFS).

They've got a bug open with Mellanox already but I was wondering if anyone
else had seen anything similar?

cheers!
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/
Ryan Novosielski
2017-10-26 16:30:29 UTC
Permalink
Where is this driver from? OS, or OFED, or?

We use primarily MVAPICH2 but I would be curious to try to duplicate this on our mlx5 equipment.

What model cards do you have?
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - ***@rutgers.edu<mailto:***@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Oct 26, 2017, at 07:43, Chris Samuel <***@unimelb.edu.au<mailto:***@unimelb.edu.au>> wrote:

Hi folks,

I'm helping another group out and we've found that running an Open-MPI
program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5
cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS
or other diagnostics and has to be power cycled.

Disabling openib/verbs support with:

export OMPI_MCA_btl=tcp,self,vader

stops the crashes, and whilst it's hard to tell strace seems to imply it hangs
when trying to probe for openib/verbs devices (or shortly after).

Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and I'm
reasonably convinced this has to be a driver bug, or perhaps a bad interaction
with recent 4.11.x and 4.12.x kernels (they need those for CephFS).

They've got a bug open with Mellanox already but I was wondering if anyone
else had seen anything similar?

cheers!
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au<mailto:***@unimelb.edu.au> Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org<mailto:***@beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.beowulf.org%2Fmailman%2Flistinfo%2Fbeowulf&data=02%7C01%7Cnovosirj%40rutgers.edu%7C919d4d1a79fe443eaa1608d51c66c114%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636446150021038393&sdata=ZTHOeZxgYMtG7XVnZJw3BebEz4rypdmkCuW3ZVraLiQ%3D&reserved=0
Chris Samuel
2017-10-27 04:36:29 UTC
Permalink
Post by Ryan Novosielski
Where is this driver from? OS, or OFED, or?
OFED 4.1 sorry.
Post by Ryan Novosielski
We use primarily MVAPICH2 but I would be curious to try to duplicate this on
our mlx5 equipment.
What model cards do you have?
These are MT27710 and MT27800 family cards. I don't have access to the exact
specs I'm afraid.

thanks!
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit htt
Lance Wilson via Beowulf
2017-10-26 21:58:02 UTC
Permalink
Hi Chris,
We are running CX4 cards and have had some issues as well. Which version/s
of openmpi are they running?

If you follow the instructions from Mellanox and run with yalla and mxm
that works(ish) of openmpi 1.10.3, including setting the appropriate
environment variables or config file.

If they are running the 2.1 series from openmpi there are some issues with
compiling in the mellanox drivers.

We haven't seen any hard locks like this but we have seen a whole bundle of
other issues.

Cheers,

Lance
--
Dr Lance Wilson
Characterisation Virtual Laboratory (CVL) Coordinator &
Senior HPC Consultant
Ph: 03 99055942 (+61 3 99055942)
Mobile: 0437414123 (+61 4 3741 4123)
Multi-modal Australian ScienceS Imaging and Visualisation Environment
(www.massive.org.au)
Monash University
Post by Chris Samuel
Hi folks,
I'm helping another group out and we've found that running an Open-MPI
program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5
cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS
or other diagnostics and has to be power cycled.
export OMPI_MCA_btl=tcp,self,vader
stops the crashes, and whilst it's hard to tell strace seems to imply it hangs
when trying to probe for openib/verbs devices (or shortly after).
Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and I'm
reasonably convinced this has to be a driver bug, or perhaps a bad interaction
with recent 4.11.x and 4.12.x kernels (they need those for CephFS).
They've got a bug open with Mellanox already but I was wondering if anyone
else had seen anything similar?
cheers!
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Chris Samuel
2017-10-27 04:38:29 UTC
Permalink
Post by Lance Wilson via Beowulf
We are running CX4 cards and have had some issues as well. Which version/s
of openmpi are they running?
This is with OMPI 1.10.x, 2.0.2 and 3.0.0.

Unfortunately only OMPI 3.0.0 seems compatible with their Slurm install
(17.02.7) as the earlier versions fail with a PMI2 related error message. :-(

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit h
Christopher Samuel
2017-11-05 23:52:06 UTC
Permalink
Post by Chris Samuel
I'm helping another group out and we've found that running an Open-MPI
program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5
cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS
or other diagnostics and has to be power cycled.
It was indeed a driver bug, and is now fixed in Mellanox OFED 4.2 (which
came out a few days ago).

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) vis
Loading...