Discussion:
[Beowulf] [OT] HPCWire marketing credit
C Bergström
2017-09-18 21:33:27 UTC
Permalink
Hi

A few years ago PathScale had excess cash and pre-paid $30k to Tabor for
some marketing. We ended up not executing on the plan as expected and the
credit still remains. If anyone is interested in all or part of it message
me asap. I'm willing to let it go for 20-33% of dollar value. SC17 is just
around the corner if you're planning to promote something.

(sorry non-technical)

./C
Faraz Hussain
2017-09-19 15:27:55 UTC
Permalink
I have never understood what these acronyms are. I've been involved
with HPC on the applications side for many years and hear these terms
pop up now and then. I've read through the wikipedia pages but still
do not understand what they mean. Can someone give a very high level
overview of what they are and how they relate to HPC?

Specifically, how does it relate to a simple hello world mpi program
like this:

http://mpitutorial.com/tutorials/mpi-hello-world/

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mail
Peter St. John
2017-09-19 16:19:56 UTC
Permalink
These days I just use wiki to look up acronyms, e.g.
RDMA == https://en.wikipedia.org/wiki/Remote_direct_memory_access

For a common english word like "verbs" it would help to see the context.
Beware of vendors making stuff up to sound cool.

Peter
I have never understood what these acronyms are. I've been involved with
HPC on the applications side for many years and hear these terms pop up now
and then. I've read through the wikipedia pages but still do not understand
what they mean. Can someone give a very high level overview of what they
are and how they relate to HPC?
Specifically, how does it relate to a simple hello world mpi program like
http://mpitutorial.com/tutorials/mpi-hello-world/
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Peter Kjellström
2017-09-19 16:24:22 UTC
Permalink
On Tue, 19 Sep 2017 09:27:55 -0600
Post by Faraz Hussain
I have never understood what these acronyms are. I've been involved
with HPC on the applications side for many years and hear these
terms pop up now and then. I've read through the wikipedia pages but
still do not understand what they mean. Can someone give a very high
level overview of what they are and how they relate to HPC?
rdma: remote direct memory access, a way to move data from one node to
another. Supported by for example Infiniband.

ofed: a software distribution of network drivers, libraries, utilities
typically used by users/applications to run on Infiniband (and other
networks supported by ofed). In short: infiniband drivers

verbs: one of many protocols supported on Infiniband providing "native
performance". For example an MPI library can use verbs to efficiently
transmit data over infiniband. Other protocols supported by Infiniband
these days include UCX, MXM, OFI, DAPL, rsockets, ...

psm: low level protocol that runs specifically on Pathscale/Truescale
class Infiniband hardware. On this hardware psm is vastly more
efficient than for example verbs. On modern Omnipath/OPA psm2 is used
in a similar fashion.
Post by Faraz Hussain
Specifically, how does it relate to a simple hello world mpi program
http://mpitutorial.com/tutorials/mpi-hello-world/
On an Infiniband cluster, when using multiple nodes, the MPI library
can use verbs or psm to send data efficiently over the network. Verbs
and psm drivers may come from an installation of ofed on these nodes
(or not if using the OS built in Infiniband support).

Different MPI implementations (such as OpenMPI, IntelMPI, MVAPICH2
etc.) makes different choices as to which protocols to use depending on
many things (including available headers at compile time etc.).

Cheers,
Peter
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.
Christopher Samuel
2017-09-19 23:54:28 UTC
Permalink
Great explanations Peter.
Post by Peter Kjellström
ofed: a software distribution of network drivers, libraries,
utilities typically used by users/applications to run on Infiniband
(and other networks supported by ofed).
To expand on that slightly, this also includes (to add to the acronym
soup) RoCE - RDMA over Converged Ethernet - in other words using
Ethernet networks (with appropriate switches) to do the sort of RDMA
that you can do over Infiniband.

This is important as unlike Infiniband you can't do RoCE out of the box
with something like RHEL7 (at least in the experience of the folks I'm
helping out here).

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http:/
Lux, Jim (337C)
2017-10-10 01:09:48 UTC
Permalink
RDMA - shared memory with a communications link - so it's not *really* a dual ported memory, but, rather, a simulation of it with a (virtual) memory space on each side, and some magic in the infrastructure that makes it seem like simultaneous access to the same physical memory. I suppose there's really no need for physical memory to correspond to the virtual memory, at least not in a one-to-one mapping sense.

There are, of course, all kinds of concurrency issues and insuring that both sides see the same data - kind of ensuring per processor cache consistency on a multiprocessor machine.

You'll also see the term reflective memory.

Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)

-----Original Message-----
From: Beowulf [mailto:beowulf-***@beowulf.org] On Behalf Of Christopher Samuel
Sent: Tuesday, September 19, 2017 4:54 PM
To: ***@beowulf.org
Subject: Re: [Beowulf] What is rdma, ofed, verbs, psm etc?

Great explanations Peter.
Post by Peter Kjellström
ofed: a software distribution of network drivers, libraries, utilities
typically used by users/applications to run on Infiniband (and other
networks supported by ofed).
To expand on that slightly, this also includes (to add to the acronym
soup) RoCE - RDMA over Converged Ethernet - in other words using Ethernet networks (with appropriate switches) to do the sort of RDMA that you can do over Infiniband.

This is important as unlike Infiniband you can't do RoCE out of the box with something like RHEL7 (at least in the experience of the folks I'm helping out here).

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visi
Faraz Hussain
2017-09-20 15:03:28 UTC
Permalink
Thanks Peter for the high level overview! A few followup questions.
What if I am using a non-Infiniband cluster, i.e something with
10gigE. Or even slower like at my home I have a raspbery pi cluster
with 100 Mbps ethernet. Is ofed/psm/verbs all irrelevant? If so, what
would their equivalents be? I assume RDMA is still applicable since I
can run openmpi on these clusters.

Another question, who is typically responsible for tuning
ofed/psm/verbs etc on an Infiniband cluster? Is it generally the
vendor who builds the cluster or the sys.admin? My role has always
been more user-facing application support. But I am wondering how much
time I should invest in learning the inner workings of ofed/psm/verbs
etc?

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
Christopher Samuel
2017-09-21 02:09:43 UTC
Permalink
Thanks Peter for the high level overview! A few followup questions. What
if I am using a non-Infiniband cluster, i.e something with 10gigE.  Or
even slower like at my home I have a raspbery pi cluster with 100 Mbps
ethernet. Is ofed/psm/verbs all irrelevant?
Pretty much, yes, unless you've got fancy switches that can do RoCE.
If so, what would their equivalents be? I assume RDMA is still applicable
since I can run openmpi on these clusters.
No, Open-MPI will be using TCP/IP for communications on those, so you'll
pay the extra latency overhead for that.
Another question, who is typically responsible for tuning ofed/psm/verbs
etc on an Infiniband cluster? Is it generally the vendor who builds the
cluster or the sys.admin?
It depends on the site & the install I suspect. We do all the OS
installs on our systems and so we (the sysadmin team) get to deal with that.
My role has always been more user-facing
application support. But I am wondering how much time I should invest in
learning the inner workings of ofed/psm/verbs etc?
If you have the gear that can use it then it is worth understanding the
basics, even just doing some performance comparisons can be educational.
But of course you have to have the gear that can do this in the first place!

Best of luck,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
Jon Tegner
2017-09-21 05:02:58 UTC
Permalink
What about RoCE? Is this something that is commonly used (I would guess
no since I have not found much)? Are there other protocols that are
worth considering (like "gamma" which doesn't seem to be developed anymore)?

My impression is that with RoCE you have to use specialized hardware
(unlike gamma - where one could use standard hardware, and still get a
noticeable improvement in latency)?

Thoughts?

/jon
Post by Christopher Samuel
Thanks Peter for the high level overview! A few followup questions. What
if I am using a non-Infiniband cluster, i.e something with 10gigE.  Or
even slower like at my home I have a raspbery pi cluster with 100 Mbps
ethernet. Is ofed/psm/verbs all irrelevant?
Pretty much, yes, unless you've got fancy switches that can do RoCE.
Douglas Eadline
2017-09-21 15:28:12 UTC
Permalink
Post by Jon Tegner
What about RoCE? Is this something that is commonly used (I would guess
no since I have not found much)? Are there other protocols that are
worth considering (like "gamma" which doesn't seem to be developed anymore)?
Gamma has not been around for years. There was open-mx

http://open-mx.gforge.inria.fr/

But, the project stopped in 2012.

I think the main reason it stopped was that IB is the
choice of most clusters and most 10G nics provide
low latency with default tcp/ip (less than 10us in most cases).

--
Doug
Post by Jon Tegner
My impression is that with RoCE you have to use specialized hardware
(unlike gamma - where one could use standard hardware, and still get a
noticeable improvement in latency)?
Thoughts?
/jon
Post by Christopher Samuel
Thanks Peter for the high level overview! A few followup questions. What
if I am using a non-Infiniband cluster, i.e something with 10gigE.  Or
even slower like at my home I have a raspbery pi cluster with 100 Mbps
ethernet. Is ofed/psm/verbs all irrelevant?
Pretty much, yes, unless you've got fancy switches that can do RoCE.
--
MailScanner: Clean
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Doug
--
MailScanner: Clean

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://ww
Jon Tegner
2017-09-21 16:09:31 UTC
Permalink
What kind of latency can one expect using RoCE/10G?
Post by Douglas Eadline
I think the main reason it stopped was that IB is the
choice of most clusters and most 10G nics provide
low latency with default tcp/ip (less than 10us in most cases).
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) vis
Alex Chekholko
2017-09-21 18:14:22 UTC
Permalink
I don't know about RoCE but here is plain 10G ICMP ECHO round trip:

# ping n0030
PING n0030.localdomain (192.168.0.204) 56(84) bytes of data.
64 bytes from n0030.localdomain (192.168.0.204): icmp_seq=1 ttl=64
time=0.141 ms
64 bytes from n0030.localdomain (192.168.0.204): icmp_seq=2 ttl=64
time=0.139 ms
64 bytes from n0030.localdomain (192.168.0.204): icmp_seq=3 ttl=64
time=0.117 ms
64 bytes from n0030.localdomain (192.168.0.204): icmp_seq=4 ttl=64
time=0.115 ms
64 bytes from n0030.localdomain (192.168.0.204): icmp_seq=5 ttl=64
time=0.115 ms
64 bytes from n0030.localdomain (192.168.0.204): icmp_seq=6 ttl=64
time=0.107 ms
64 bytes from n0030.localdomain (192.168.0.204): icmp_seq=7 ttl=64
time=0.142 ms
^C
--- n0030.localdomain ping statistics ---
7 packets transmitted, 7 received, 0% packet loss, time 6000ms
rtt min/avg/max/mdev = 0.107/0.125/0.142/0.015 ms

YMMV
Post by Jon Tegner
What kind of latency can one expect using RoCE/10G?
Post by Douglas Eadline
I think the main reason it stopped was that IB is the
choice of most clusters and most 10G nics provide
low latency with default tcp/ip (less than 10us in most cases).
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Christopher Samuel
2017-09-23 20:32:35 UTC
Permalink
What does a 0 byte MPI ping-pong look like?

From memory (I'm at Berkeley at the moment) with RoCE and Mellanox
100gigE switches they provide slightly better (lower) latency than our
circa 2013 FDR14 Infiniband cluster.

All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman
Brice Goglin
2017-09-21 19:45:25 UTC
Permalink
Post by Douglas Eadline
Post by Jon Tegner
What about RoCE? Is this something that is commonly used (I would guess
no since I have not found much)? Are there other protocols that are
worth considering (like "gamma" which doesn't seem to be developed anymore)?
Gamma has not been around for years. There was open-mx
http://open-mx.gforge.inria.fr/
But, the project stopped in 2012.
I think the main reason it stopped was that IB is the
choice of most clusters and most 10G nics provide
low latency with default tcp/ip (less than 10us in most cases).
No, the reason it stopped is that I didn't have much research to do
with it anymore
I added support for newer Linux kernels until late 2015 but there's
wasn't any interesting work since 2011.

Another reason is that copy offload became less interesting in hardware
(Intel didn't improve it as much as the rest of the memory subsystem)
and also became difficult to use in recent kernels.


Regarding latency, most NICs had fairly dumb interrupt coalescing a
couple years ago. You wouldn't get less than 10us on pingpong (requires
no coalescing) without killing the stream bandwidth (requires coalescing
unless you can waste CPU cycles). I didn't check recently.

Brice

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo
John Hearns via Beowulf
2017-09-24 20:30:57 UTC
Permalink
Jon,
ROCE is commonly used. We run GPFS over ROCE and plenty of other sites do
also.

To answer questions on what network ROCE needs, I guess you could run it on
a 1 Gbps network with office grade network switches.
What it really needs is a lossless network. Dare I saw the Mellanox word....

I think you would find ROCE is a lot more prevalent than you would think...
I guess we should brin in GPUdirect and NVME over Fabrics here.
Google finds this website: http://www.roceinitiative.org/
What about RoCE? Is this something that is commonly used (I would guess no
since I have not found much)? Are there other protocols that are worth
considering (like "gamma" which doesn't seem to be developed anymore)?
My impression is that with RoCE you have to use specialized hardware
(unlike gamma - where one could use standard hardware, and still get a
noticeable improvement in latency)?
Thoughts?
/jon
Thanks Peter for the high level overview! A few followup questions. What
if I am using a non-Infiniband cluster, i.e something with 10gigE. Or
even slower like at my home I have a raspbery pi cluster with 100 Mbps
ethernet. Is ofed/psm/verbs all irrelevant?
Pretty much, yes, unless you've got fancy switches that can do RoCE.
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Prentice Bisbal
2018-02-19 15:56:17 UTC
Permalink
Post by Peter Kjellström
On Tue, 19 Sep 2017 09:27:55 -0600
Post by Faraz Hussain
I have never understood what these acronyms are. I've been involved
with HPC on the applications side for many years and hear these
terms pop up now and then. I've read through the wikipedia pages but
still do not understand what they mean. Can someone give a very high
level overview of what they are and how they relate to HPC?
rdma: remote direct memory access, a way to move data from one node to
another. Supported by for example Infiniband.
ofed: a software distribution of network drivers, libraries, utilities
typically used by users/applications to run on Infiniband (and other
networks supported by ofed). In short: infiniband drivers
OFED = OpenFabrics Enterprise Distribution.

https://www.openfabrics.org/index.php/openfabrics-software.html
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/lis
Elken, Tom
2018-02-20 17:03:11 UTC
Permalink
-----Original Message-----
Post by Peter Kjellström
On Tue, 19 Sep 2017 09:27:55 -0600
OFED = OpenFabrics Enterprise Distribution.

https://www.openfabrics.org/index.php/openfabrics-software.html

The above is a good overview, which didn't mention PSM.

The following, on Slide 3 shows how a lot of these fit together:
https://www.openfabrics.org/images/eventpresos/2016presentations/304PSM2Features.pdf

-Tom
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowu
Loading...