Discussion:
[Beowulf] slow mpi init/finalize
Michael Di Domenico
2017-10-11 14:12:02 UTC
Permalink
i'm seeing issues on a mellanox fdr10 cluster where the mpi setup and
teardown takes longer then i expect it should on larger rank count
jobs. i'm only trying to run ~1000 ranks and the startup time is over
a minute. i tested this with both openmpi and intel mpi, both exhibit
close to the same behavior.

has anyone else seen this or might know how to fix it? i expect ~1000
ranks to take sometime to setup, but it seems to be taking longer then
i think it should
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/l
Christopher Samuel
2017-10-15 23:08:49 UTC
Permalink
Post by Michael Di Domenico
i'm seeing issues on a mellanox fdr10 cluster where the mpi setup and
teardown takes longer then i expect it should on larger rank count
jobs. i'm only trying to run ~1000 ranks and the startup time is over
a minute. i tested this with both openmpi and intel mpi, both exhibit
close to the same behavior.
What wire-up protocol are you using for your MPI in your batch system?

With Slurm at least you should be looking at using PMIx or PMI2 (PMIx
needs Slurm to be compiled against it as an external library, PMI2 is a
contrib plugin in the source tree).

Hope that helps..
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailma
Peter Kjellström
2017-10-16 11:16:12 UTC
Permalink
On Wed, 11 Oct 2017 10:12:02 -0400
Post by Michael Di Domenico
i'm seeing issues on a mellanox fdr10 cluster where the mpi setup and
teardown takes longer then i expect it should on larger rank count
jobs. i'm only trying to run ~1000 ranks and the startup time is over
a minute. i tested this with both openmpi and intel mpi, both exhibit
close to the same behavior.
First, that performance is not expected nor good. It should be sub 1s
for 1000 ranks or so YMMV...

One possibility is that you got some slow and/or flaky tcp/ip/eth
involved somehow.

Another is that your MPIs tried to use rdmacm and that in turn tried to
use ibacm which, if incorrectly setup, times out after ~1m. You can
verify ibacm functionality by running for example:

***@n1 $ ib_acme -d n2
...
***@n1 $

This should be near instant if ibacm works as it should.

If you use IntelMPI (and by default then dapl). Edit your dat.conf or
manually select the ucm dapl provider. This is fast and does not use
rdmacm.

Good luck,
Peter K
Post by Michael Di Domenico
has anyone else seen this or might know how to fix it? i expect ~1000
ranks to take sometime to setup, but it seems to be taking longer then
i think it should
_______________________________________________
Computing To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Michael Di Domenico
2017-10-16 17:11:37 UTC
Permalink
Post by Peter Kjellström
Another is that your MPIs tried to use rdmacm and that in turn tried to
use ibacm which, if incorrectly setup, times out after ~1m. You can
...
This should be near instant if ibacm works as it should.
i didn't specifically tell mpi to use one connection setup vs another,
but i'll see if i can track down what openmpi is doing in that regard.

however, your test above fails on my machines

***@n1# ib_acme -d n3
service: localhost
destination: n3
ib_acm_resolve_ip failed: cannot assign requested address
return status 0x0

in the /etc/rdma/ibacme_addr.cfg file i just lists the data specific
to each host, which is gathered by ib_acme -A

truthfully i never configured, i though it just "worked" on it's own,
but perhaps not. i'll have to google some
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org
Peter Kjellström
2017-10-17 12:54:14 UTC
Permalink
On Mon, 16 Oct 2017 13:11:37 -0400
Post by Michael Di Domenico
Post by Peter Kjellström
Another is that your MPIs tried to use rdmacm and that in turn
tried to use ibacm which, if incorrectly setup, times out after
...
This should be near instant if ibacm works as it should.
i didn't specifically tell mpi to use one connection setup vs another,
but i'll see if i can track down what openmpi is doing in that regard.
however, your test above fails on my machines
service: localhost
destination: n3
ib_acm_resolve_ip failed: cannot assign requested address
return status 0x0
Did this fail instantly or with the typical ~1m timeout?
Post by Michael Di Domenico
in the /etc/rdma/ibacme_addr.cfg file i just lists the data specific
to each host, which is gathered by ib_acme -A
Often you don't need ibacm running and if you stop it this specific
problem will go away (ie. no one can ask ibacm for stuff and hang on
timeout). The service is typically /etc/init.d/ibacm. What will happen
then if something uses librdmacm for lookups is that it will result in a
direct query to the SA (part of the subnet manager). On a larger
cluster and for certain use cases this can quickly become too much
(hence the need for caching).

If you have IntelMPI also try what I suggested and use the ucm dapl.
For example for the first port on an mlx4 hca that's "ofa-v2-mlx4_0-1u".

You can make sure that it comes first in your dat.conf (/etc/rmda
or /etc/infiniband) or pass it explicitly to IntelMPI:

I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u mpiexec.hydra ...

You may want to set I_MPI_DEBUG=4 or so to see what it does.

/Peter K
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beo
Michael Di Domenico
2017-10-17 13:51:43 UTC
Permalink
Post by Peter Kjellström
Post by Michael Di Domenico
however, your test above fails on my machines
service: localhost
destination: n3
ib_acm_resolve_ip failed: cannot assign requested address
return status 0x0
Did this fail instantly or with the typical ~1m timeout?
it fails instantly.
Post by Peter Kjellström
If you have IntelMPI also try what I suggested and use the ucm dapl.
For example for the first port on an mlx4 hca that's "ofa-v2-mlx4_0-1u".
You can make sure that it comes first in your dat.conf (/etc/rmda
I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u mpiexec.hydra ...
You may want to set I_MPI_DEBUG=4 or so to see what it does.
i'll give this a whirl today hopefully
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowul
Peter Kjellström
2017-10-17 14:12:22 UTC
Permalink
On Tue, 17 Oct 2017 09:51:43 -0400
Post by Michael Di Domenico
Post by Peter Kjellström
Post by Michael Di Domenico
however, your test above fails on my machines
service: localhost
destination: n3
ib_acm_resolve_ip failed: cannot assign requested address
return status 0x0
Did this fail instantly or with the typical ~1m timeout?
it fails instantly.
Then probably this is not the problem.

Also, I noted that you provided some contradicting data in a post to
the openfabrics users list. The output there included references to
qib0 (truescale infiniband) while in this thread you started by saying
FDR10 (which is only available on mellanox infiniband).

The two situations are quite different wrt. MPI protocol stack and as
such debugging.

On truescale IntelMPI may run on tmi that runs on psm (as opposed to
IntelMPI->dapl->daploucm).

/Peter K
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinf
Michael Di Domenico
2017-10-17 14:36:28 UTC
Permalink
Post by Peter Kjellström
Also, I noted that you provided some contradicting data in a post to
the openfabrics users list. The output there included references to
qib0 (truescale infiniband) while in this thread you started by saying
FDR10 (which is only available on mellanox infiniband).
yes, i used another cluster to provide the example for the ibacm
problems to the other ofed list. i see the exact same results in the
ibacm tests whether it's mellanox or truescale
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/m
Michael Di Domenico
2017-10-17 14:59:41 UTC
Permalink
Post by Peter Kjellström
If you have IntelMPI also try what I suggested and use the ucm dapl.
For example for the first port on an mlx4 hca that's "ofa-v2-mlx4_0-1u".
You can make sure that it comes first in your dat.conf (/etc/rmda
I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u mpiexec.hydra ...
You may want to set I_MPI_DEBUG=4 or so to see what it does.
i can confirm that the dapl test with intelmpi is pretty speedy.

when i startup an mpi job without dapl enabled it takes ~60 seconds
before the test actually starts, with dapl enabled it's only a few
seconds. and the t_avg timings in imb alltoallv i'm running are
vastly different.

i think i can safely say at this point it's probably not hardware
related, but something went wonky with openmpi. i downloaded the new
version 3 that was released, i'll see if that fixes anything. i've
been tracking reports on the openmpi list about issues between slurm
and openmpi with relation to pmi, i'm not sure if it's related or not,
but might be.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.
Peter Kjellström
2017-10-17 16:01:04 UTC
Permalink
On Tue, 17 Oct 2017 10:59:41 -0400
Michael Di Domenico <***@gmail.com> wrote:
...
Post by Michael Di Domenico
Post by Peter Kjellström
I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u mpiexec.hydra ...
You may want to set I_MPI_DEBUG=4 or so to see what it does.
i can confirm that the dapl test with intelmpi is pretty speedy.
It may be interesting to see what it picks by default (compare output
with I_MPI_DEBUG)..
Post by Michael Di Domenico
when i startup an mpi job without dapl enabled it takes ~60 seconds
I think you mean using the default dapl provider (vs. the specific ucm
provider I suggested). IntelMPI should default to dapl on Mellanox
regardless of version I think (unless possibly if your IntelMPI is very
new and you have a libfabric version installed...).
Post by Michael Di Domenico
before the test actually starts, with dapl enabled it's only a few
seconds.
That is still very slow. For reference I timed 1024 rank startup on one
of our systems with IntelMPI and dapl on ucm and it's a bit below 0.5s
depending on how you time it (some amount of lazy init is happening).

If I force IntelMPI on that system to run using verbs,
I_MPI_FABRICS=ofa, then that startup takes 5 seconds (~10x slower).

I have not tested a dapl provider using rdmacm as that would require me
to change our system dat.conf I think..

Either way, with 60s time scales and ibacm so broken it fails instantly
I suspect you have some hostname/dns/tcp-ip-on-eth or other fundamental
problem somewhere.

/Peter K
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowul
Michael Di Domenico
2017-10-17 16:17:51 UTC
Permalink
Post by Peter Kjellström
That is still very slow. For reference I timed 1024 rank startup on one
of our systems with IntelMPI and dapl on ucm and it's a bit below 0.5s
depending on how you time it (some amount of lazy init is happening).
i didn't specifically time it, so my "few seconds" might be inline
with your .5 second
Post by Peter Kjellström
Either way, with 60s time scales and ibacm so broken it fails instantly
I suspect you have some hostname/dns/tcp-ip-on-eth or other fundamental
problem somewhere.
it's certainly possible. unfortunately the documentation is lacking
and no one on the ofa list wants to help and i don't have time to
trounce through source code to figure out what's going on. at some
point i'll figure it.

but clearly something is wonky, at least i can set aside the hardware
aspect for now.

thanks
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/
Christopher Samuel
2017-10-17 22:53:04 UTC
Permalink
Post by Michael Di Domenico
i think i can safely say at this point it's probably not hardware
related, but something went wonky with openmpi. i downloaded the new
version 3 that was released, i'll see if that fixes anything.
You are building Open-MPI with the config option:

--with-verbs

to get it to enable IB support?

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.o
Michael Di Domenico
2017-10-18 11:41:55 UTC
Permalink
On Tue, Oct 17, 2017 at 6:53 PM, Christopher Samuel
Post by Christopher Samuel
Post by Michael Di Domenico
i think i can safely say at this point it's probably not hardware
related, but something went wonky with openmpi. i downloaded the new
version 3 that was released, i'll see if that fixes anything.
--with-verbs
to get it to enable IB support?
i'm not specifically adding that line to my configure statement.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beow
Chris Samuel
2017-10-18 13:00:40 UTC
Permalink
Post by Michael Di Domenico
On Tue, Oct 17, 2017 at 6:53 PM, Christopher Samuel
Post by Christopher Samuel
--with-verbs
to get it to enable IB support?
i'm not specifically adding that line to my configure statement.
In that case you're unlikely to be enabling Infiniband support, so you'll need
to pass that through to enable it (and you'll need to have the appropriate IB
development packages installed too).

What does this say?

ompi_info | fgrep openib

Best of luck,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
Michael Di Domenico
2017-10-18 13:17:53 UTC
Permalink
Post by Chris Samuel
In that case you're unlikely to be enabling Infiniband support, so you'll need
to pass that through to enable it (and you'll need to have the appropriate IB
development packages installed too).
What does this say?
ompi_info | fgrep openib
MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v2.0.2)

I'm pretty sure openib is getting pulled in by default during the
configure steps. i'll have to go back and check the config.log files
too be sure. i usually install all the required devel packages and
it's definitely using RDMA for the message passing.
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.or
Peter Kjellström
2017-10-18 13:40:11 UTC
Permalink
On Wed, 18 Oct 2017 09:17:53 -0400
Post by Michael Di Domenico
Post by Chris Samuel
In that case you're unlikely to be enabling Infiniband support, so
you'll need to pass that through to enable it (and you'll need to
have the appropriate IB development packages installed too).
What does this say?
ompi_info | fgrep openib
MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v2.0.2)
I'm pretty sure openib is getting pulled in by default during the
configure steps. i'll have to go back and check the config.log files
too be sure. i usually install all the required devel packages and
it's definitely using RDMA for the message passing.
Yes, not specifying is essentially:

"build if required devel pkgs available else silently skip"

/Peter K
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowu
Loading...