Discussion:
[Beowulf] How to know if infiniband network works?
Faraz Hussain
2017-08-02 16:44:17 UTC
Permalink
I have inherited a 20-node cluster that supposedly has an infiniband
network. I am testing some mpi applications and am seeing no
performance improvement with multiple nodes. So I am wondering if the
Infiband network even works?

The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools
ib0 and it shows:

Speed: 40000Mb/s
Link detected: no

and for ib1 it show:

Speed: 10000Mb/s
Link detected: no

I am assuming this means it is down? Any idea how to debug further and
restart it?

Thanks!

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http
Joe Landman
2017-08-02 16:50:24 UTC
Permalink
start with

ibv_devinfo

ibstat

ibstatus


and see what (if anything) they report.

Second, how did you compile/run your MPI code?
Post by Faraz Hussain
I have inherited a 20-node cluster that supposedly has an infiniband
network. I am testing some mpi applications and am seeing no
performance improvement with multiple nodes. So I am wondering if the
Infiband network even works?
The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools
Speed: 40000Mb/s
Link detected: no
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further and
restart it?
Thanks!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beo
Faraz Hussain
2017-08-02 17:50:06 UTC
Permalink
Thanks Joe. Here is the output from the commands you suggested. We
have open mpi built from Intel mpi compiler. Is there some benchmark
code I can compile so that we are all comparing the same code?

[***@lustwzb4 test]$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.550
node_guid: f452:1403:0016:3b70
sys_image_guid: f452:1403:0016:3b73
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: DEL0A40000028
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 3
port_lmc: 0x00
link_layer: InfiniBand

port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand

[***@lustwzb4 test]$ ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.11.550
Hardware version: 0
Node GUID: 0xf452140300163b70
System image GUID: 0xf452140300163b73
Port 1:
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0xf452140300163b71
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xf452140300163b72
Link layer: InfiniBand

[***@lustwzb4 test]$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:f452:1403:0016:3b71
base lid: 0x3
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X FDR10)
link_layer: InfiniBand

Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:f452:1403:0016:3b72
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 10 Gb/sec (4X)
link_layer: InfiniBand
Post by Joe Landman
start with
ibv_devinfo
ibstat
ibstatus
and see what (if anything) they report.
Second, how did you compile/run your MPI code?
Post by Faraz Hussain
I have inherited a 20-node cluster that supposedly has an
infiniband network. I am testing some mpi applications and am
seeing no performance improvement with multiple nodes. So I am
wondering if the Infiband network even works?
The output of ifconfig -a shows an ib0 and ib1 network. I ran
Speed: 40000Mb/s
Link detected: no
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further
and restart it?
Thanks!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/list
Joe Landman
2017-08-02 18:37:00 UTC
Permalink
Post by Faraz Hussain
Thanks Joe. Here is the output from the commands you suggested. We
have open mpi built from Intel mpi compiler. Is there some benchmark
code I can compile so that we are all comparing the same code?
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.550
node_guid: f452:1403:0016:3b70
sys_image_guid: f452:1403:0016:3b73
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: DEL0A40000028
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
Port 1 on the machine is up. This is the link level activity that the
subnet manager (OpenSM or a switch level version) enables.

For OpenMPI, my recollection is that they expect the IB ports to have
ethernet addresses as well (and will switch to RDMA after initialization).

What does

ifconfig -a

report?
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
Jeff Johnson
2017-08-02 23:29:05 UTC
Permalink
Faraz,

You can test your point to point rdma bandwidth as well.

On host lustwz99 run `qperf`
On any of the hosts lustwzb1-16 run `qperf lustwz99 -t 30 rc_lat rc_bi_bw`

Establish that you can pass traffic at expected speeds before going to the
ipoib portion.

Also make sure that all of your node are running in the same mode,
connected or datagram and that your MTU is the same on all nodes for that
IP interface.

--Jeff
Thanks Joe. Here is the output from the commands you suggested. We have
open mpi built from Intel mpi compiler. Is there some benchmark code I can
compile so that we are all comparing the same code?
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.550
node_guid: f452:1403:0016:3b70
sys_image_guid: f452:1403:0016:3b73
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: DEL0A40000028
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 3
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.11.550
Hardware version: 0
Node GUID: 0xf452140300163b70
System image GUID: 0xf452140300163b73
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0xf452140300163b71
Link layer: InfiniBand
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xf452140300163b72
Link layer: InfiniBand
default gid: fe80:0000:0000:0000:f452:1403:0016:3b71
base lid: 0x3
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X FDR10)
link_layer: InfiniBand
default gid: fe80:0000:0000:0000:f452:1403:0016:3b72
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 10 Gb/sec (4X)
link_layer: InfiniBand
start with
Post by Joe Landman
ibv_devinfo
ibstat
ibstatus
and see what (if anything) they report.
Second, how did you compile/run your MPI code?
Post by Faraz Hussain
I have inherited a 20-node cluster that supposedly has an infiniband
network. I am testing some mpi applications and am seeing no performance
improvement with multiple nodes. So I am wondering if the Infiband network
even works?
The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools
Speed: 40000Mb/s
Link detected: no
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further and
restart it?
Thanks!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

***@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
Faraz Hussain
2017-08-03 13:21:16 UTC
Permalink
I ran the qperf command between two compute nodes ( b4 and b5 ) and got:

[***@lustwzb5 ~]$ qperf lustwzb4 -t 30 rc_lat rc_bi_bw
rc_lat:

fd
latency = 7.73 us
rc_bi_bw:
bw = 9.06 GB/sec

If I understand correctly, I would need to enable ipoib and then rerun
test? It would then show ~40GB/sec I assume.
Post by Jeff Johnson
Faraz,
You can test your point to point rdma bandwidth as well.
On host lustwz99 run `qperf`
On any of the hosts lustwzb1-16 run `qperf lustwz99 -t 30 rc_lat rc_bi_bw`
Establish that you can pass traffic at expected speeds before going to the
ipoib portion.
Also make sure that all of your node are running in the same mode,
connected or datagram and that your MTU is the same on all nodes for that
IP interface.
--Jeff
Thanks Joe. Here is the output from the commands you suggested. We have
open mpi built from Intel mpi compiler. Is there some benchmark code I can
compile so that we are all comparing the same code?
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.550
node_guid: f452:1403:0016:3b70
sys_image_guid: f452:1403:0016:3b73
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: DEL0A40000028
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 3
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.11.550
Hardware version: 0
Node GUID: 0xf452140300163b70
System image GUID: 0xf452140300163b73
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0xf452140300163b71
Link layer: InfiniBand
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xf452140300163b72
Link layer: InfiniBand
default gid: fe80:0000:0000:0000:f452:1403:0016:3b71
base lid: 0x3
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X FDR10)
link_layer: InfiniBand
default gid: fe80:0000:0000:0000:f452:1403:0016:3b72
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 10 Gb/sec (4X)
link_layer: InfiniBand
start with
Post by Joe Landman
ibv_devinfo
ibstat
ibstatus
and see what (if anything) they report.
Second, how did you compile/run your MPI code?
Post by Faraz Hussain
I have inherited a 20-node cluster that supposedly has an infiniband
network. I am testing some mpi applications and am seeing no performance
improvement with multiple nodes. So I am wondering if the Infiband network
even works?
The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools
Speed: 40000Mb/s
Link detected: no
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further and
restart it?
Thanks!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite D - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.b
Joe Landman
2017-08-03 13:26:13 UTC
Permalink
Post by Faraz Hussain
fd
latency = 7.73 us
bw = 9.06 GB/sec
If I understand correctly, I would need to enable ipoib and then rerun
test? It would then show ~40GB/sec I assume.
No. 9GB/s is about 80 Gb/s. Infiniband is working. Looks like you
might have dual-rail IB setup, or you were doing a bidirectional/full
duplex test.
--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mail
Gus Correa
2017-08-02 16:58:07 UTC
Permalink
Hi Faraz

1) lsmod | grep ib should show if the Infinband kernel modules are loaded.

2) Infinband normally uses remote DMA (rdma) through "verbs".
You should see an "ib" module with "verbs" in the name.
That is the preferred/faster mode for MPI.

3) However, you can also use Infinband for TCP/IP (slower).
As the output of your ifconfig shows, your ib0 interface is
also configured for TCP/IP.

4) You may have two interfaces (one card with two or two cards) in the
nodes. One may not be connected to a switch (ib1). Check the back of
your nodes.

5) To check if MPI is using it, depends a bit on which MPI library
you're using.
Which one? Open MPI, MVAPICH2, some vendor/proprietary one?
If it is Open MPI the command "ompi-info" will tell.
With Open MPI there are also ways to enable/disable
Infiniband at runtime.

6) Some Infinband diagnostics may also help (normally in /usr/sbin)

ibstat
ibhosts
ibnetdiscover

etc

OK, this is my pedestrian view of Infinband.
Now let's hear the experts in the list for deeper insights. :)

I hope this helps,
Gus Correa
Post by Faraz Hussain
I have inherited a 20-node cluster that supposedly has an infiniband
network. I am testing some mpi applications and am seeing no performance
improvement with multiple nodes. So I am wondering if the Infiband
network even works?
The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools
Speed: 40000Mb/s
Link detected: no
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further and
restart it?
Thanks!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/list
Faraz Hussain
2017-08-02 17:37:42 UTC
Permalink
Thanks for the tips. We have openmpi installed. Here is some relevant
output from the commands you suggested. One confusing thing is ibstat
shows only port 1 as active. But ibhosts shows port 2 only.

[***@lustwzb4 test]$ lsmod | grep ib
ib_ucm 12120 0
ib_ipoib 114971 0
ib_cm 42214 3 ib_ucm,rdma_cm,ib_ipoib
ib_uverbs 50244 2 rdma_ucm,ib_ucm
ib_umad 12562 0
mlx5_ib 103326 0
mlx5_core 85201 1 mlx5_ib
mlx4_ib 164865 0
ib_sa 24170 5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib
ib_mad 43241 4 ib_cm,ib_umad,mlx4_ib,ib_sa
ib_core 95458 12
rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad
ib_addr 7732 3 rdma_cm,ib_uverbs,ib_core
ipv6 317829 145 ib_ipoib,mlx4_ib,ib_addr
mlx4_core 258183 2 mlx4_en,mlx4_ib
compat 23876 17
rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx5_core,mlx4_en,mlx4_ib,ib_sa,ib_mad,ib_core,ib_addr,mlx4_core
libcrc32c 1246 1 bnx2x

[***@lustwzb4 test]$ ompi_info | grep ib

MCA btl: openib (MCA v2.0, API v2.0, Component v1.8.4)

[***@lustwzb4 test]$ ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.11.550
Hardware version: 0
Node GUID: 0xf452140300163b70
System image GUID: 0xf452140300163b73
Port 1:
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0xf452140300163b71
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xf452140300163b72
Link layer: InfiniBand

[***@lustwzb4 test]$ ibhosts
Ca : 0xf45214030015bf60 ports 2 "lustwzb9 HCA-1"
Ca : 0xf45214030015c0e0 ports 2 "lustwzb16 HCA-1"
Ca : 0xf452140300163e20 ports 2 "lustwzb15 HCA-1"
Ca : 0xf45214030015c080 ports 2 "lustwzb14 HCA-1"
Ca : 0xf45214030015c290 ports 2 "lustwzb13 HCA-1"
Ca : 0xf45214030015bf70 ports 2 "lustwzb12 HCA-1"
Ca : 0xf452140300163bb0 ports 2 "lustwzb11 HCA-1"
Ca : 0xf452140300163c70 ports 2 "lustwzb10 HCA-1"
Ca : 0xf452140300163e30 ports 2 "lustwzb8 HCA-1"
Ca : 0xf452140300163b80 ports 2 "lustwzb7 HCA-1"
Ca : 0xf452140300163ba0 ports 2 "lustwzb6 HCA-1"
Ca : 0xf45214030015bfb0 ports 2 "lustwzb5 HCA-1"
Ca : 0xf45214030015bf90 ports 2 "lustwzb3 HCA-1"
Ca : 0xf452140300163df0 ports 2 "lustwzb2 HCA-1"
Ca : 0xf45214030015c0a0 ports 2 "lustwzb1 HCA-1"
Ca : 0x0002c90300b78240 ports 1 "lustwz99 HCA-1"
Ca : 0xf452140300163b70 ports 2 "lustwzb4 HCA-1"

[***@lustwzb4 test]$ ibnetdiscover
#
# Topology file: generated on Wed Aug 2 13:24:40 2017
#
# Initiated from node f452140300163b70 port f452140300163b71

vendid=0x2c9
devid=0xc738
sysimgguid=0x2c9030089cab0
switchguid=0x2c9030089cab0(2c9030089cab0)
Switch 32 "S-0002c9030089cab0" # "SwitchX - Mellanox
Technologies" base port 0 lid 2 lmc 0
[16] "H-0002c90300b78240"[1](2c90300b78241) # "lustwz99
HCA-1" lid 1 4xFDR10
[17] "H-f45214030015c0a0"[1](f45214030015c0a1) #
"lustwzb1 HCA-1" lid 5 4xFDR10
[18] "H-f452140300163df0"[1](f452140300163df1) #
"lustwzb2 HCA-1" lid 6 4xFDR10
[19] "H-f45214030015bf90"[1](f45214030015bf91) #
"lustwzb3 HCA-1" lid 4 4xFDR10
[20] "H-f452140300163b70"[1](f452140300163b71) #
"lustwzb4 HCA-1" lid 3 4xFDR10
[21] "H-f45214030015bfb0"[1](f45214030015bfb1) #
"lustwzb5 HCA-1" lid 7 4xFDR10
[22] "H-f452140300163ba0"[1](f452140300163ba1) #
"lustwzb6 HCA-1" lid 8 4xFDR10
[23] "H-f452140300163b80"[1](f452140300163b81) #
"lustwzb7 HCA-1" lid 9 4xFDR10
[24] "H-f452140300163e30"[1](f452140300163e31) #
"lustwzb8 HCA-1" lid 10 4xFDR10
[25] "H-f45214030015bf60"[1](f45214030015bf61) #
"lustwzb9 HCA-1" lid 11 4xFDR10
[26] "H-f452140300163c70"[1](f452140300163c71) #
"lustwzb10 HCA-1" lid 12 4xFDR10
[27] "H-f452140300163bb0"[1](f452140300163bb1) #
"lustwzb11 HCA-1" lid 13 4xFDR10
[28] "H-f45214030015bf70"[1](f45214030015bf71) #
"lustwzb12 HCA-1" lid 14 4xFDR10
[29] "H-f45214030015c290"[1](f45214030015c291) #
"lustwzb13 HCA-1" lid 15 4xFDR10
[30] "H-f45214030015c080"[1](f45214030015c081) #
"lustwzb14 HCA-1" lid 16 4xFDR10
[31] "H-f452140300163e20"[1](f452140300163e21) #
"lustwzb15 HCA-1" lid 17 4xFDR10
[32] "H-f45214030015c0e0"[1](f45214030015c0e1) #
"lustwzb16 HCA-1" lid 18 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c0e3
caguid=0xf45214030015c0e0
Ca 2 "H-f45214030015c0e0" # "lustwzb16 HCA-1"
[1](f45214030015c0e1) "S-0002c9030089cab0"[32] # lid
18 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163e23
caguid=0xf452140300163e20
Ca 2 "H-f452140300163e20" # "lustwzb15 HCA-1"
[1](f452140300163e21) "S-0002c9030089cab0"[31] # lid
17 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c083
caguid=0xf45214030015c080
Ca 2 "H-f45214030015c080" # "lustwzb14 HCA-1"
[1](f45214030015c081) "S-0002c9030089cab0"[30] # lid
16 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf73
caguid=0xf45214030015bf70
Ca 2 "H-f45214030015bf70" # "lustwzb12 HCA-1"
[1](f45214030015bf71) "S-0002c9030089cab0"[28] # lid
14 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c293
caguid=0xf45214030015c290
Ca 2 "H-f45214030015c290" # "lustwzb13 HCA-1"
[1](f45214030015c291) "S-0002c9030089cab0"[29] # lid
15 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf63
caguid=0xf45214030015bf60
Ca 2 "H-f45214030015bf60" # "lustwzb9 HCA-1"
[1](f45214030015bf61) "S-0002c9030089cab0"[25] # lid
11 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163bb3
caguid=0xf452140300163bb0
Ca 2 "H-f452140300163bb0" # "lustwzb11 HCA-1"
[1](f452140300163bb1) "S-0002c9030089cab0"[27] # lid
13 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163c73
caguid=0xf452140300163c70
Ca 2 "H-f452140300163c70" # "lustwzb10 HCA-1"
[1](f452140300163c71) "S-0002c9030089cab0"[26] # lid
12 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163e33
caguid=0xf452140300163e30
Ca 2 "H-f452140300163e30" # "lustwzb8 HCA-1"
[1](f452140300163e31) "S-0002c9030089cab0"[24] # lid
10 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163b83
caguid=0xf452140300163b80
Ca 2 "H-f452140300163b80" # "lustwzb7 HCA-1"
[1](f452140300163b81) "S-0002c9030089cab0"[23] # lid
9 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bfb3
caguid=0xf45214030015bfb0
Ca 2 "H-f45214030015bfb0" # "lustwzb5 HCA-1"
[1](f45214030015bfb1) "S-0002c9030089cab0"[21] # lid
7 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163ba3
caguid=0xf452140300163ba0
Ca 2 "H-f452140300163ba0" # "lustwzb6 HCA-1"
[1](f452140300163ba1) "S-0002c9030089cab0"[22] # lid
8 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163df3
caguid=0xf452140300163df0
Ca 2 "H-f452140300163df0" # "lustwzb2 HCA-1"
[1](f452140300163df1) "S-0002c9030089cab0"[18] # lid
6 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf93
caguid=0xf45214030015bf90
Ca 2 "H-f45214030015bf90" # "lustwzb3 HCA-1"
[1](f45214030015bf91) "S-0002c9030089cab0"[19] # lid
4 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c0a3
caguid=0xf45214030015c0a0
Ca 2 "H-f45214030015c0a0" # "lustwzb1 HCA-1"
[1](f45214030015c0a1) "S-0002c9030089cab0"[17] # lid
5 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0x2c90300b78243
caguid=0x2c90300b78240
Ca 1 "H-0002c90300b78240" # "lustwz99 HCA-1"
[1](2c90300b78241) "S-0002c9030089cab0"[16] # lid
1 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163b73
caguid=0xf452140300163b70
Ca 2 "H-f452140300163b70" # "lustwzb4 HCA-1"
[1](f452140300163b71) "S-0002c9030089cab0"[20]
Post by Gus Correa
Hi Faraz
1) lsmod | grep ib should show if the Infinband kernel modules are loaded.
2) Infinband normally uses remote DMA (rdma) through "verbs".
You should see an "ib" module with "verbs" in the name.
That is the preferred/faster mode for MPI.
3) However, you can also use Infinband for TCP/IP (slower).
As the output of your ifconfig shows, your ib0 interface is
also configured for TCP/IP.
4) You may have two interfaces (one card with two or two cards) in
the nodes. One may not be connected to a switch (ib1). Check the
back of your nodes.
5) To check if MPI is using it, depends a bit on which MPI library
you're using.
Which one? Open MPI, MVAPICH2, some vendor/proprietary one?
If it is Open MPI the command "ompi-info" will tell.
With Open MPI there are also ways to enable/disable
Infiniband at runtime.
6) Some Infinband diagnostics may also help (normally in /usr/sbin)
ibstat
ibhosts
ibnetdiscover
etc
OK, this is my pedestrian view of Infinband.
Now let's hear the experts in the list for deeper insights. :)
I hope this helps,
Gus Correa
Post by Faraz Hussain
I have inherited a 20-node cluster that supposedly has an
infiniband network. I am testing some mpi applications and am
seeing no performance improvement with multiple nodes. So I am
wondering if the Infiband network even works?
The output of ifconfig -a shows an ib0 and ib1 network. I ran
Speed: 40000Mb/s
Link detected: no
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further
and restart it?
Thanks!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
Gus Correa
2017-08-02 18:25:09 UTC
Permalink
Hi Faraz

The output of lsmod looks good to me.
It shows that you have verbs, rdma, etc.
Presumably this happens in all nodes (the output you sent
is likely to be in one node, lustwzb4 or something like that).

ompi-info shows that Open MPI was built with openib (Infinband)
support. So, another good thing.
Therefore, by default Open MPI will try to use Inifinband,
unless one of the nodes' IB card has a problem,
or the IB kernel modules were not loaded, etc.
But you shouldn't worry about it until it happens.


I think ibhosts is just telling you that the NICs
have two ports ("ports 2", with a space in between).

Also, check the back of the nodes for the IB cable connections.
They're thick cables, should be connected to the IB switch.
You will *probably* find two IB ports in the nodes, with only
one connected. At least that is what your ifconfig output suggests.

ibstat runs only on the node you're in.
If you have a tool such as pdsh (parallel shell),
you can use it to run ibstat on all nodes.
Or just ssh to each node and run ibstat.

Anyway, I don't see any red flag or problem.
[Well, unless somebody else spots something that I haven't seen,
which is *very* possible.]
It seems to be good to go to run MPI (Open MPI) programs
using Infinband.


********

Now some items a bit out of topic, not a specific answer to your
question, but hopefully they may help.

1) pdsh

Do you have a head/master node in the cluster?
Is it lustwzb99 perhaps?
You could run pdsh from there.
It is very helpful for cluster-wide checks, etc.
(You can install it if not there, sometimes there
is also "dsh" already installed, although older.)

https://sourceforge.net/projects/pdsh/

[It may be available as package (rpm or similar)
for your Linux distribution also.]

2) Open MPI details and customization

I'd suggest that you take a look at the Open MPI FAQ,
for more details, specially how to control things at runtime.
They have zillions of "MCA parameters" that allow a lot of
customization, if you care:

https://www.open-mpi.org/faq/

Their README file (you can get it in their tarball) is also
a good source of information.

3) Resource managers and integration with Open MPI

Also, if you have a "resource manager" (a.k.a. job queue system),
such as Torque/PBS, Slurm, SGE, you may want to look into integrating
it with Open MPI (if it is not already this way), and how to
set up the job scripts to take advantage of that integration.
The Open MPI FAQs have some material on this (and the Open MPI README
file also), but you may need to consult the "resource manager"
documentation as well. [If you're using Torque start with "man qsub".]


4) Open MPI installation: NFS vs. local

You may need to check if Open MPI is installed, say,
in an NFS shared directory, visible to all nodes,
or perhaps installed via package (RPM or similar) on
all nodes.
In the latter case, make sure you have the same exact
version (including the compiler that was used to build it) everywhere.
Installing on NFS makes life easier on small clusters (for updates, etc).
Make sure the NFS directory is exported/mounted to/by all nodes.

5) Environment variables and "envrionment modules" package

You may need also to set some environment variables (such as PATH and
LD_LIBRARY_PATH) to ensure that Open MPI (and any other software) works.
The simplest way is brute force in the .bashrc/.tcshrc initialization
files.

However, I'd recommend taking a look at the "environment modules"
package, that provides a much cleaner solution, and makes it easy
for users to switch from one compiler to another, from one version
of Open MPI to another, etc.
If you provide a variety of versions of software, that is a must.

http://modules.sourceforge.net/


[Available as package in many Linux distros.]

**

I hope this helps,
Gus Correa
Post by Faraz Hussain
Thanks for the tips. We have openmpi installed. Here is some relevant
output from the commands you suggested. One confusing thing is ibstat
shows only port 1 as active. But ibhosts shows port 2 only.
ib_ucm 12120 0
ib_ipoib 114971 0
ib_cm 42214 3 ib_ucm,rdma_cm,ib_ipoib
ib_uverbs 50244 2 rdma_ucm,ib_ucm
ib_umad 12562 0
mlx5_ib 103326 0
mlx5_core 85201 1 mlx5_ib
mlx4_ib 164865 0
ib_sa 24170 5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib
ib_mad 43241 4 ib_cm,ib_umad,mlx4_ib,ib_sa
ib_core 95458 12
rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad
ib_addr 7732 3 rdma_cm,ib_uverbs,ib_core
ipv6 317829 145 ib_ipoib,mlx4_ib,ib_addr
mlx4_core 258183 2 mlx4_en,mlx4_ib
compat 23876 17
rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx5_core,mlx4_en,mlx4_ib,ib_sa,ib_mad,ib_core,ib_addr,mlx4_core
libcrc32c 1246 1 bnx2x
MCA btl: openib (MCA v2.0, API v2.0, Component v1.8.4)
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.11.550
Hardware version: 0
Node GUID: 0xf452140300163b70
System image GUID: 0xf452140300163b73
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0xf452140300163b71
Link layer: InfiniBand
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xf452140300163b72
Link layer: InfiniBand
Ca : 0xf45214030015bf60 ports 2 "lustwzb9 HCA-1"
Ca : 0xf45214030015c0e0 ports 2 "lustwzb16 HCA-1"
Ca : 0xf452140300163e20 ports 2 "lustwzb15 HCA-1"
Ca : 0xf45214030015c080 ports 2 "lustwzb14 HCA-1"
Ca : 0xf45214030015c290 ports 2 "lustwzb13 HCA-1"
Ca : 0xf45214030015bf70 ports 2 "lustwzb12 HCA-1"
Ca : 0xf452140300163bb0 ports 2 "lustwzb11 HCA-1"
Ca : 0xf452140300163c70 ports 2 "lustwzb10 HCA-1"
Ca : 0xf452140300163e30 ports 2 "lustwzb8 HCA-1"
Ca : 0xf452140300163b80 ports 2 "lustwzb7 HCA-1"
Ca : 0xf452140300163ba0 ports 2 "lustwzb6 HCA-1"
Ca : 0xf45214030015bfb0 ports 2 "lustwzb5 HCA-1"
Ca : 0xf45214030015bf90 ports 2 "lustwzb3 HCA-1"
Ca : 0xf452140300163df0 ports 2 "lustwzb2 HCA-1"
Ca : 0xf45214030015c0a0 ports 2 "lustwzb1 HCA-1"
Ca : 0x0002c90300b78240 ports 1 "lustwz99 HCA-1"
Ca : 0xf452140300163b70 ports 2 "lustwzb4 HCA-1"
#
# Topology file: generated on Wed Aug 2 13:24:40 2017
#
# Initiated from node f452140300163b70 port f452140300163b71
vendid=0x2c9
devid=0xc738
sysimgguid=0x2c9030089cab0
switchguid=0x2c9030089cab0(2c9030089cab0)
Switch 32 "S-0002c9030089cab0" # "SwitchX - Mellanox
Technologies" base port 0 lid 2 lmc 0
[16] "H-0002c90300b78240"[1](2c90300b78241) # "lustwz99
HCA-1" lid 1 4xFDR10
[17] "H-f45214030015c0a0"[1](f45214030015c0a1) #
"lustwzb1 HCA-1" lid 5 4xFDR10
[18] "H-f452140300163df0"[1](f452140300163df1) #
"lustwzb2 HCA-1" lid 6 4xFDR10
[19] "H-f45214030015bf90"[1](f45214030015bf91) #
"lustwzb3 HCA-1" lid 4 4xFDR10
[20] "H-f452140300163b70"[1](f452140300163b71) #
"lustwzb4 HCA-1" lid 3 4xFDR10
[21] "H-f45214030015bfb0"[1](f45214030015bfb1) #
"lustwzb5 HCA-1" lid 7 4xFDR10
[22] "H-f452140300163ba0"[1](f452140300163ba1) #
"lustwzb6 HCA-1" lid 8 4xFDR10
[23] "H-f452140300163b80"[1](f452140300163b81) #
"lustwzb7 HCA-1" lid 9 4xFDR10
[24] "H-f452140300163e30"[1](f452140300163e31) #
"lustwzb8 HCA-1" lid 10 4xFDR10
[25] "H-f45214030015bf60"[1](f45214030015bf61) #
"lustwzb9 HCA-1" lid 11 4xFDR10
[26] "H-f452140300163c70"[1](f452140300163c71) #
"lustwzb10 HCA-1" lid 12 4xFDR10
[27] "H-f452140300163bb0"[1](f452140300163bb1) #
"lustwzb11 HCA-1" lid 13 4xFDR10
[28] "H-f45214030015bf70"[1](f45214030015bf71) #
"lustwzb12 HCA-1" lid 14 4xFDR10
[29] "H-f45214030015c290"[1](f45214030015c291) #
"lustwzb13 HCA-1" lid 15 4xFDR10
[30] "H-f45214030015c080"[1](f45214030015c081) #
"lustwzb14 HCA-1" lid 16 4xFDR10
[31] "H-f452140300163e20"[1](f452140300163e21) #
"lustwzb15 HCA-1" lid 17 4xFDR10
[32] "H-f45214030015c0e0"[1](f45214030015c0e1) #
"lustwzb16 HCA-1" lid 18 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c0e3
caguid=0xf45214030015c0e0
Ca 2 "H-f45214030015c0e0" # "lustwzb16 HCA-1"
[1](f45214030015c0e1) "S-0002c9030089cab0"[32] # lid 18
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163e23
caguid=0xf452140300163e20
Ca 2 "H-f452140300163e20" # "lustwzb15 HCA-1"
[1](f452140300163e21) "S-0002c9030089cab0"[31] # lid 17
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c083
caguid=0xf45214030015c080
Ca 2 "H-f45214030015c080" # "lustwzb14 HCA-1"
[1](f45214030015c081) "S-0002c9030089cab0"[30] # lid 16
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf73
caguid=0xf45214030015bf70
Ca 2 "H-f45214030015bf70" # "lustwzb12 HCA-1"
[1](f45214030015bf71) "S-0002c9030089cab0"[28] # lid 14
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c293
caguid=0xf45214030015c290
Ca 2 "H-f45214030015c290" # "lustwzb13 HCA-1"
[1](f45214030015c291) "S-0002c9030089cab0"[29] # lid 15
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf63
caguid=0xf45214030015bf60
Ca 2 "H-f45214030015bf60" # "lustwzb9 HCA-1"
[1](f45214030015bf61) "S-0002c9030089cab0"[25] # lid 11
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163bb3
caguid=0xf452140300163bb0
Ca 2 "H-f452140300163bb0" # "lustwzb11 HCA-1"
[1](f452140300163bb1) "S-0002c9030089cab0"[27] # lid 13
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163c73
caguid=0xf452140300163c70
Ca 2 "H-f452140300163c70" # "lustwzb10 HCA-1"
[1](f452140300163c71) "S-0002c9030089cab0"[26] # lid 12
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163e33
caguid=0xf452140300163e30
Ca 2 "H-f452140300163e30" # "lustwzb8 HCA-1"
[1](f452140300163e31) "S-0002c9030089cab0"[24] # lid 10
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163b83
caguid=0xf452140300163b80
Ca 2 "H-f452140300163b80" # "lustwzb7 HCA-1"
[1](f452140300163b81) "S-0002c9030089cab0"[23] # lid 9
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bfb3
caguid=0xf45214030015bfb0
Ca 2 "H-f45214030015bfb0" # "lustwzb5 HCA-1"
[1](f45214030015bfb1) "S-0002c9030089cab0"[21] # lid 7
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163ba3
caguid=0xf452140300163ba0
Ca 2 "H-f452140300163ba0" # "lustwzb6 HCA-1"
[1](f452140300163ba1) "S-0002c9030089cab0"[22] # lid 8
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163df3
caguid=0xf452140300163df0
Ca 2 "H-f452140300163df0" # "lustwzb2 HCA-1"
[1](f452140300163df1) "S-0002c9030089cab0"[18] # lid 6
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf93
caguid=0xf45214030015bf90
Ca 2 "H-f45214030015bf90" # "lustwzb3 HCA-1"
[1](f45214030015bf91) "S-0002c9030089cab0"[19] # lid 4
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c0a3
caguid=0xf45214030015c0a0
Ca 2 "H-f45214030015c0a0" # "lustwzb1 HCA-1"
[1](f45214030015c0a1) "S-0002c9030089cab0"[17] # lid 5
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0x2c90300b78243
caguid=0x2c90300b78240
Ca 1 "H-0002c90300b78240" # "lustwz99 HCA-1"
[1](2c90300b78241) "S-0002c9030089cab0"[16] # lid 1
lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163b73
caguid=0xf452140300163b70
Ca 2 "H-f452140300163b70" # "lustwzb4 HCA-1"
[1](f452140300163b71) "S-0002c9030089cab0"[20]
Post by Gus Correa
Hi Faraz
1) lsmod | grep ib should show if the Infinband kernel modules are loaded.
2) Infinband normally uses remote DMA (rdma) through "verbs".
You should see an "ib" module with "verbs" in the name.
That is the preferred/faster mode for MPI.
3) However, you can also use Infinband for TCP/IP (slower).
As the output of your ifconfig shows, your ib0 interface is
also configured for TCP/IP.
4) You may have two interfaces (one card with two or two cards) in the
nodes. One may not be connected to a switch (ib1). Check the back of
your nodes.
5) To check if MPI is using it, depends a bit on which MPI library
you're using.
Which one? Open MPI, MVAPICH2, some vendor/proprietary one?
If it is Open MPI the command "ompi-info" will tell.
With Open MPI there are also ways to enable/disable
Infiniband at runtime.
6) Some Infinband diagnostics may also help (normally in /usr/sbin)
ibstat
ibhosts
ibnetdiscover
etc
OK, this is my pedestrian view of Infinband.
Now let's hear the experts in the list for deeper insights. :)
I hope this helps,
Gus Correa
Post by Faraz Hussain
I have inherited a 20-node cluster that supposedly has an infiniband
network. I am testing some mpi applications and am seeing no
performance improvement with multiple nodes. So I am wondering if the
Infiband network even works?
The output of ifconfig -a shows an ib0 and ib1 network. I ran
Speed: 40000Mb/s
Link detected: no
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further
and restart it?
Thanks!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://w
Christopher Samuel
2017-08-02 23:13:36 UTC
Permalink
Post by Faraz Hussain
I have inherited a 20-node cluster that supposedly has an infiniband
network. I am testing some mpi applications and am seeing no performance
improvement with multiple nodes.
As you are using Open-MPI you should be able to tell it to only use IB
(and fail if it cannot) by doing this before running the application:

export OMPI_MCA_btl=openib,self,sm

Out of interest, are you running it via a batch system of some sort?

All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://
t***@renget.se
2017-08-03 06:59:16 UTC
Permalink
I often use
mpirun --np 2 --machinefile mpd.hosts mpitests-osu_latency
mpirun --np 2 --machinefile mpd.hosts mpitests-osu_bw
To test bandwidth and latency between to specific nodes (listed in mpd.hosts). On a CentOS/Redhat system these can be installed from the package mpitests-openmpi.

/jon
I have inherited a 20-node cluster that supposedly has an infiniband network. I am testing some mpi applications and am seeing no performance improvement with multiple nodes. So I am wondering if the Infiband network even works?
Speed: 40000Mb/s
Link detected: no
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further and restart it?
Thanks!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit <http://www.beowulf.org/mailman/listinfo/beowulf>
Faraz Hussain
2017-08-03 14:10:45 UTC
Permalink
Thanks, I installed the MPI tests from Ohio State. I ran osu_bw and
got the results below. What is confusing is I get the same result if I
use tcp or openib ( by doing --mca btl openib|tcp,self with my mpirun
command ). I also tried changing the environment variable: export
OMPI_MCA_btl=tcp,self,sm . Results are the same regardless of tcp or
openib..

And when I do ifconfig -a I still see zero traffic reported for the
ib0 and ib1 network.

# OSU MPI Bandwidth Test v5.3.2
# Size Bandwidth (MB/s)
1 1.23
2 6.55
4 12.83
8 25.42
16 49.35
32 101.99
64 190.78
128 362.64
256 712.64
512 576.00
1024 2410.36
2048 3548.19
4096 3427.19
8192 4259.77
16384 4399.37
32768 4566.43
65536 4617.49
131072 4682.98
262144 4690.70
524288 4701.48
1048576 4697.40
2097152 4706.88
4194304 4710.76
Post by t***@renget.se
I often use
mpirun --np 2 --machinefile mpd.hosts mpitests-osu_latency
mpirun --np 2 --machinefile mpd.hosts mpitests-osu_bw
To test bandwidth and latency between to specific nodes (listed in
mpd.hosts). On a CentOS/Redhat system these can be installed from
the package mpitests-openmpi.
/jon
On 2 August 2017 at 18:44:17 +02:00, Faraz Hussain
Post by Faraz Hussain
I have inherited a 20-node cluster that supposedly has an
infiniband network. I am testing some mpi applications and am
seeing no performance improvement with multiple nodes. So I am
wondering if the Infiband network even works?
The output of ifconfig -a shows an ib0 and ib1 network. I ran
Speed: 40000Mb/s
Link detected: no
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further and restart it?
Thanks!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
Michael Di Domenico
2017-08-03 14:21:12 UTC
Permalink
Thanks, I installed the MPI tests from Ohio State. I ran osu_bw and got the
results below. What is confusing is I get the same result if I use tcp or
openib ( by doing --mca btl openib|tcp,self with my mpirun command ). I also
tried changing the environment variable: export OMPI_MCA_btl=tcp,self,sm .
Results are the same regardless of tcp or openib..
And when I do ifconfig -a I still see zero traffic reported for the ib0 and
ib1 network.
if openmpi uses RDMA for the traffic ib0/ib1 will not show traffic,
you have to use perfquery
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/be
John Hearns via Beowulf
2017-08-03 14:33:28 UTC
Permalink
Fazar,
I think that you have got things sorted out.
However I think that the number of optiosn in OpenMPI is starting to
confuse you. But do not lose heart!
I have been in the same place myself many time. Specifically I am thinking
on one time when a customer asked me to benthmark the latency across 10Gbps
interfaces,
on a cluster where there was already a 1Gbps network and a Mellanox
Infiniband network. I had to be careful to exclude the networks I did NOT
want!

I suggest that you set the verbose flag in mpirun and keep a copy of the
output. GO through that output line by line making sure you understand what
it is telling you.
I have done that many times!


Secondly you say " I also tried changing the environment variable: export
OMPI_MCA_btl=tcp,self,sm " - remember that you can switch OFF a transport
by using ^tcp
Please give that a try - ie I mean epxlicitly request openmpi transport and
^tcp
Post by Michael Di Domenico
Thanks, I installed the MPI tests from Ohio State. I ran osu_bw and got
the
results below. What is confusing is I get the same result if I use tcp or
openib ( by doing --mca btl openib|tcp,self with my mpirun command ). I
also
tried changing the environment variable: export OMPI_MCA_btl=tcp,self,sm
.
Results are the same regardless of tcp or openib..
And when I do ifconfig -a I still see zero traffic reported for the ib0
and
ib1 network.
if openmpi uses RDMA for the traffic ib0/ib1 will not show traffic,
you have to use perfquery
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Faraz Hussain
2017-08-03 15:41:17 UTC
Permalink
Thanks for everyone's help. Using the Ohio State tests, qperf and
perfquery I am convinced the IB network is working. The only thing
that still bothers me is I can not get mpirun to use the tcp network.
I tried all combinations of --mca btl to no avail. It is not
important, more just curiosity.
Post by Michael Di Domenico
Thanks, I installed the MPI tests from Ohio State. I ran osu_bw and got the
results below. What is confusing is I get the same result if I use tcp or
openib ( by doing --mca btl openib|tcp,self with my mpirun command ). I also
tried changing the environment variable: export OMPI_MCA_btl=tcp,self,sm .
Results are the same regardless of tcp or openib..
And when I do ifconfig -a I still see zero traffic reported for the ib0 and
ib1 network.
if openmpi uses RDMA for the traffic ib0/ib1 will not show traffic,
you have to use perfquery
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscrib
John Hearns via Beowulf
2017-08-03 15:59:50 UTC
Permalink
Faraz, do you mean the IPOIB tcp network, ie the ib0 interface?
Good question. I would advise joining the Openmpi list. They are very
friendly over there.
I have always seen polite and helpful replies even to dumb questions there
(such as the ones I ask).

I actually had to do something similar recently - we have nodes with only
IB, so I had to run OpenMPI over Infiniband,
but also say that the control connection had to use the ib0 interface.
Post by Faraz Hussain
Thanks for everyone's help. Using the Ohio State tests, qperf and
perfquery I am convinced the IB network is working. The only thing that
still bothers me is I can not get mpirun to use the tcp network. I tried
all combinations of --mca btl to no avail. It is not important, more just
curiosity.
Post by Michael Di Domenico
Thanks, I installed the MPI tests from Ohio State. I ran osu_bw and got the
results below. What is confusing is I get the same result if I use tcp or
openib ( by doing --mca btl openib|tcp,self with my mpirun command ). I also
tried changing the environment variable: export OMPI_MCA_btl=tcp,self,sm .
Results are the same regardless of tcp or openib..
And when I do ifconfig -a I still see zero traffic reported for the ib0 and
ib1 network.
if openmpi uses RDMA for the traffic ib0/ib1 will not show traffic,
you have to use perfquery
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Gus Correa
2017-08-03 16:37:45 UTC
Permalink
Hi Faraz

+1 to John's suggestion of joining the Open MPI list.
Your questions are now veering towards Open MPI specifics,
and you will get great feedback on this topic there.

If you want to use TCP/IP instead of RDMA
(say, IPoIB or Gigabit Ethernet cards),
you can use it if you tell Open MPI not to use openib
(--mca btl ^openib).

You can specify the interfaces to use also.
Please, see these Open MPI FAQ:

https://www.open-mpi.org/faq/?category=tcp#tcp-selection

These could be your IPoIB interfaces (ib0 or the corresponding subnet
addresses), or Ethernet.

Worth looking also at the FAQ about Infiniband:
https://www.open-mpi.org/faq/?category=openfabrics

Verbosity can be turned on with mca parameters that you can find with:

ompi-info --all |grep verbose

btl_base_verbose is a good start.

Note that Open MPI also uses network interfaces for the
startup, manage, and wrapup communications.
This uses another framework, "out of band" (oob), separate from the
"btl" (byte transport layer), with a corresponding
set of mca paramters that look like this: "--mca oob ..."

Overall their FAQ have very good information, as does the
README file in their tarball.
https://www.open-mpi.org/faq/
https://github.com/open-mpi/ompi/blob/master/README

I hope this helps,
Gus Correa
Post by John Hearns via Beowulf
Faraz, do you mean the IPOIB tcp network, ie the ib0 interface?
Good question. I would advise joining the Openmpi list. They are very
friendly over there.
I have always seen polite and helpful replies even to dumb questions
there (such as the ones I ask).
I actually had to do something similar recently - we have nodes with
only IB, so I had to run OpenMPI over Infiniband,
but also say that the control connection had to use the ib0 interface.
Thanks for everyone's help. Using the Ohio State tests, qperf and
perfquery I am convinced the IB network is working. The only thing
that still bothers me is I can not get mpirun to use the tcp
network. I tried all combinations of --mca btl to no avail. It is
not important, more just curiosity.
On Thu, Aug 3, 2017 at 10:10 AM, Faraz Hussain
Thanks, I installed the MPI tests from Ohio State. I ran
osu_bw and got the
results below. What is confusing is I get the same result if
I use tcp or
openib ( by doing --mca btl openib|tcp,self with my mpirun
command ). I also
tried changing the environment variable: export
OMPI_MCA_btl=tcp,self,sm .
Results are the same regardless of tcp or openib..
And when I do ifconfig -a I still see zero traffic reported
for the ib0 and
ib1 network.
if openmpi uses RDMA for the traffic ib0/ib1 will not show traffic,
you have to use perfquery
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visi
Jeff Johnson
2017-08-03 16:16:14 UTC
Permalink
Faraz,

I didn't notice any tests where you actually tested the ip layer. You
should run some iperf tests between nodes to make sure ipoib functions.
Your infiniband/rdma can be working fine and ipoib can be dysfunctional.
You need to ensure the ipoib configuration, like any ip environment, is
configured the same on all nodes (network/subnet, netmask, mtu, etc) and
that all of the nodes are configured for the same mode (connected vs
datagram). If you can't run iperf then there is something broken in the
ipoib configuration.

--Jeff
Post by Faraz Hussain
Thanks for everyone's help. Using the Ohio State tests, qperf and
perfquery I am convinced the IB network is working. The only thing that
still bothers me is I can not get mpirun to use the tcp network. I tried
all combinations of --mca btl to no avail. It is not important, more just
curiosity.
Post by Michael Di Domenico
Thanks, I installed the MPI tests from Ohio State. I ran osu_bw and got the
results below. What is confusing is I get the same result if I use tcp or
openib ( by doing --mca btl openib|tcp,self with my mpirun command ). I also
tried changing the environment variable: export OMPI_MCA_btl=tcp,self,sm .
Results are the same regardless of tcp or openib..
And when I do ifconfig -a I still see zero traffic reported for the ib0 and
ib1 network.
if openmpi uses RDMA for the traffic ib0/ib1 will not show traffic,
you have to use perfquery
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

***@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
Faraz Hussain
2017-08-03 16:50:09 UTC
Permalink
Here is the result from the tcp and rdma tests. I take it to mean that
IB network is performing at the expected speed.

[***@lustwzb5 ~]$ qperf lustwzb4 -t 30 tcp_lat tcp_bw
tcp_lat:
latency = 24.2 us
tcp_bw:
bw = 1.19 GB/sec
[***@lustwzb5 ~]$ qperf lustwzb4 -t 30 rc_lat rc_bw
rc_lat:
latency = 7.76 us
rc_bw:
bw = 4.56 GB/sec
Post by Jeff Johnson
Faraz,
I didn't notice any tests where you actually tested the ip layer. You
should run some iperf tests between nodes to make sure ipoib functions.
Your infiniband/rdma can be working fine and ipoib can be dysfunctional.
You need to ensure the ipoib configuration, like any ip environment, is
configured the same on all nodes (network/subnet, netmask, mtu, etc) and
that all of the nodes are configured for the same mode (connected vs
datagram). If you can't run iperf then there is something broken in the
ipoib configuration.
--Jeff
Post by Faraz Hussain
Thanks for everyone's help. Using the Ohio State tests, qperf and
perfquery I am convinced the IB network is working. The only thing that
still bothers me is I can not get mpirun to use the tcp network. I tried
all combinations of --mca btl to no avail. It is not important, more just
curiosity.
Post by Michael Di Domenico
Thanks, I installed the MPI tests from Ohio State. I ran osu_bw and got the
results below. What is confusing is I get the same result if I use tcp or
openib ( by doing --mca btl openib|tcp,self with my mpirun command ). I also
tried changing the environment variable: export OMPI_MCA_btl=tcp,self,sm .
Results are the same regardless of tcp or openib..
And when I do ifconfig -a I still see zero traffic reported for the ib0 and
ib1 network.
if openmpi uses RDMA for the traffic ib0/ib1 will not show traffic,
you have to use perfquery
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite D - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Jon Tegner
2017-08-03 17:54:07 UTC
Permalink
Isn't latency over RDMA a bit high? When I've tested QDR and FDR I tend
to see around 1 us (using mpitests-osu_latency) between two nodes.

/jon
Post by Faraz Hussain
Here is the result from the tcp and rdma tests. I take it to mean that
IB network is performing at the expected speed.
latency = 24.2 us
bw = 1.19 GB/sec
latency = 7.76 us
bw = 4.56 GB/sec
Post by Jeff Johnson
Faraz,
I didn't notice any tests where you actually tested the ip layer. You
should run some iperf tests between nodes to make sure ipoib functions.
Your infiniband/rdma can be working fine and ipoib can be dysfunctional.
You need to ensure the ipoib configuration, like any ip environment, is
configured the same on all nodes (network/subnet, netmask, mtu, etc) and
that all of the nodes are configured for the same mode (connected vs
datagram). If you can't run iperf then there is something broken in the
ipoib configuration.
--Jeff
Post by Faraz Hussain
Thanks for everyone's help. Using the Ohio State tests, qperf and
perfquery I am convinced the IB network is working. The only thing that
still bothers me is I can not get mpirun to use the tcp network. I tried
all combinations of --mca btl to no avail. It is not important, more just
curiosity.
Post by Michael Di Domenico
Post by Faraz Hussain
Thanks, I installed the MPI tests from Ohio State. I ran osu_bw
and got
the
results below. What is confusing is I get the same result if I use tcp or
openib ( by doing --mca btl openib|tcp,self with my mpirun command
). I
also
tried changing the environment variable: export
OMPI_MCA_btl=tcp,self,sm
.
Results are the same regardless of tcp or openib..
And when I do ifconfig -a I still see zero traffic reported for
the ib0
and
ib1 network.
if openmpi uses RDMA for the traffic ib0/ib1 will not show traffic,
you have to use perfquery
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite D - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf
Faraz Hussain
2017-08-03 18:58:33 UTC
Permalink
Here are the latency numbers when running the Ohio State test:

mpirun -np 2 -machinefile hostfile ./osu_latency

# OSU MPI Latency Test v5.3.2
# Size Latency (us)
0 1.57
1 1.22
2 1.19
4 1.20
8 1.17
16 1.20
32 1.23
64 1.29
128 1.42
256 1.76
512 2.07
1024 2.62
2048 3.63
4096 4.65
8192 6.46
16384 10.34
32768 13.37
65536 19.03
131072 33.04
262144 61.70
524288 119.93
1048576 231.21
2097152 455.84
4194304 907.89
Post by Jon Tegner
Isn't latency over RDMA a bit high? When I've tested QDR and FDR I
tend to see around 1 us (using mpitests-osu_latency) between two
nodes.
/jon
Post by Faraz Hussain
Here is the result from the tcp and rdma tests. I take it to mean
that IB network is performing at the expected speed.
latency = 24.2 us
bw = 1.19 GB/sec
latency = 7.76 us
bw = 4.56 GB/sec
Post by Jeff Johnson
Faraz,
I didn't notice any tests where you actually tested the ip layer. You
should run some iperf tests between nodes to make sure ipoib functions.
Your infiniband/rdma can be working fine and ipoib can be dysfunctional.
You need to ensure the ipoib configuration, like any ip environment, is
configured the same on all nodes (network/subnet, netmask, mtu, etc) and
that all of the nodes are configured for the same mode (connected vs
datagram). If you can't run iperf then there is something broken in the
ipoib configuration.
--Jeff
Post by Faraz Hussain
Thanks for everyone's help. Using the Ohio State tests, qperf and
perfquery I am convinced the IB network is working. The only thing that
still bothers me is I can not get mpirun to use the tcp network. I tried
all combinations of --mca btl to no avail. It is not important, more just
curiosity.
Post by Michael Di Domenico
Thanks, I installed the MPI tests from Ohio State. I ran osu_bw and got the
results below. What is confusing is I get the same result if I use tcp or
openib ( by doing --mca btl openib|tcp,self with my mpirun command ). I also
tried changing the environment variable: export OMPI_MCA_btl=tcp,self,sm .
Results are the same regardless of tcp or openib..
And when I do ifconfig -a I still see zero traffic reported for the ib0 and
ib1 network.
if openmpi uses RDMA for the traffic ib0/ib1 will not show traffic,
you have to use perfquery
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite D - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) vis

Continue reading on narkive:
Loading...