Thanks for the tips. We have openmpi installed. Here is some relevant
output from the commands you suggested. One confusing thing is ibstat
shows only port 1 as active. But ibhosts shows port 2 only.
[***@lustwzb4 test]$ lsmod | grep ib
ib_ucm 12120 0
ib_ipoib 114971 0
ib_cm 42214 3 ib_ucm,rdma_cm,ib_ipoib
ib_uverbs 50244 2 rdma_ucm,ib_ucm
ib_umad 12562 0
mlx5_ib 103326 0
mlx5_core 85201 1 mlx5_ib
mlx4_ib 164865 0
ib_sa 24170 5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib
ib_mad 43241 4 ib_cm,ib_umad,mlx4_ib,ib_sa
ib_core 95458 12
rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad
ib_addr 7732 3 rdma_cm,ib_uverbs,ib_core
ipv6 317829 145 ib_ipoib,mlx4_ib,ib_addr
mlx4_core 258183 2 mlx4_en,mlx4_ib
compat 23876 17
rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx5_core,mlx4_en,mlx4_ib,ib_sa,ib_mad,ib_core,ib_addr,mlx4_core
libcrc32c 1246 1 bnx2x
[***@lustwzb4 test]$ ompi_info | grep ib
MCA btl: openib (MCA v2.0, API v2.0, Component v1.8.4)
[***@lustwzb4 test]$ ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.11.550
Hardware version: 0
Node GUID: 0xf452140300163b70
System image GUID: 0xf452140300163b73
Port 1:
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0xf452140300163b71
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xf452140300163b72
Link layer: InfiniBand
[***@lustwzb4 test]$ ibhosts
Ca : 0xf45214030015bf60 ports 2 "lustwzb9 HCA-1"
Ca : 0xf45214030015c0e0 ports 2 "lustwzb16 HCA-1"
Ca : 0xf452140300163e20 ports 2 "lustwzb15 HCA-1"
Ca : 0xf45214030015c080 ports 2 "lustwzb14 HCA-1"
Ca : 0xf45214030015c290 ports 2 "lustwzb13 HCA-1"
Ca : 0xf45214030015bf70 ports 2 "lustwzb12 HCA-1"
Ca : 0xf452140300163bb0 ports 2 "lustwzb11 HCA-1"
Ca : 0xf452140300163c70 ports 2 "lustwzb10 HCA-1"
Ca : 0xf452140300163e30 ports 2 "lustwzb8 HCA-1"
Ca : 0xf452140300163b80 ports 2 "lustwzb7 HCA-1"
Ca : 0xf452140300163ba0 ports 2 "lustwzb6 HCA-1"
Ca : 0xf45214030015bfb0 ports 2 "lustwzb5 HCA-1"
Ca : 0xf45214030015bf90 ports 2 "lustwzb3 HCA-1"
Ca : 0xf452140300163df0 ports 2 "lustwzb2 HCA-1"
Ca : 0xf45214030015c0a0 ports 2 "lustwzb1 HCA-1"
Ca : 0x0002c90300b78240 ports 1 "lustwz99 HCA-1"
Ca : 0xf452140300163b70 ports 2 "lustwzb4 HCA-1"
[***@lustwzb4 test]$ ibnetdiscover
#
# Topology file: generated on Wed Aug 2 13:24:40 2017
#
# Initiated from node f452140300163b70 port f452140300163b71
vendid=0x2c9
devid=0xc738
sysimgguid=0x2c9030089cab0
switchguid=0x2c9030089cab0(2c9030089cab0)
Switch 32 "S-0002c9030089cab0" # "SwitchX - Mellanox
Technologies" base port 0 lid 2 lmc 0
[16] "H-0002c90300b78240"[1](2c90300b78241) # "lustwz99
HCA-1" lid 1 4xFDR10
[17] "H-f45214030015c0a0"[1](f45214030015c0a1) #
"lustwzb1 HCA-1" lid 5 4xFDR10
[18] "H-f452140300163df0"[1](f452140300163df1) #
"lustwzb2 HCA-1" lid 6 4xFDR10
[19] "H-f45214030015bf90"[1](f45214030015bf91) #
"lustwzb3 HCA-1" lid 4 4xFDR10
[20] "H-f452140300163b70"[1](f452140300163b71) #
"lustwzb4 HCA-1" lid 3 4xFDR10
[21] "H-f45214030015bfb0"[1](f45214030015bfb1) #
"lustwzb5 HCA-1" lid 7 4xFDR10
[22] "H-f452140300163ba0"[1](f452140300163ba1) #
"lustwzb6 HCA-1" lid 8 4xFDR10
[23] "H-f452140300163b80"[1](f452140300163b81) #
"lustwzb7 HCA-1" lid 9 4xFDR10
[24] "H-f452140300163e30"[1](f452140300163e31) #
"lustwzb8 HCA-1" lid 10 4xFDR10
[25] "H-f45214030015bf60"[1](f45214030015bf61) #
"lustwzb9 HCA-1" lid 11 4xFDR10
[26] "H-f452140300163c70"[1](f452140300163c71) #
"lustwzb10 HCA-1" lid 12 4xFDR10
[27] "H-f452140300163bb0"[1](f452140300163bb1) #
"lustwzb11 HCA-1" lid 13 4xFDR10
[28] "H-f45214030015bf70"[1](f45214030015bf71) #
"lustwzb12 HCA-1" lid 14 4xFDR10
[29] "H-f45214030015c290"[1](f45214030015c291) #
"lustwzb13 HCA-1" lid 15 4xFDR10
[30] "H-f45214030015c080"[1](f45214030015c081) #
"lustwzb14 HCA-1" lid 16 4xFDR10
[31] "H-f452140300163e20"[1](f452140300163e21) #
"lustwzb15 HCA-1" lid 17 4xFDR10
[32] "H-f45214030015c0e0"[1](f45214030015c0e1) #
"lustwzb16 HCA-1" lid 18 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c0e3
caguid=0xf45214030015c0e0
Ca 2 "H-f45214030015c0e0" # "lustwzb16 HCA-1"
[1](f45214030015c0e1) "S-0002c9030089cab0"[32] # lid
18 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163e23
caguid=0xf452140300163e20
Ca 2 "H-f452140300163e20" # "lustwzb15 HCA-1"
[1](f452140300163e21) "S-0002c9030089cab0"[31] # lid
17 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c083
caguid=0xf45214030015c080
Ca 2 "H-f45214030015c080" # "lustwzb14 HCA-1"
[1](f45214030015c081) "S-0002c9030089cab0"[30] # lid
16 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf73
caguid=0xf45214030015bf70
Ca 2 "H-f45214030015bf70" # "lustwzb12 HCA-1"
[1](f45214030015bf71) "S-0002c9030089cab0"[28] # lid
14 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c293
caguid=0xf45214030015c290
Ca 2 "H-f45214030015c290" # "lustwzb13 HCA-1"
[1](f45214030015c291) "S-0002c9030089cab0"[29] # lid
15 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf63
caguid=0xf45214030015bf60
Ca 2 "H-f45214030015bf60" # "lustwzb9 HCA-1"
[1](f45214030015bf61) "S-0002c9030089cab0"[25] # lid
11 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163bb3
caguid=0xf452140300163bb0
Ca 2 "H-f452140300163bb0" # "lustwzb11 HCA-1"
[1](f452140300163bb1) "S-0002c9030089cab0"[27] # lid
13 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163c73
caguid=0xf452140300163c70
Ca 2 "H-f452140300163c70" # "lustwzb10 HCA-1"
[1](f452140300163c71) "S-0002c9030089cab0"[26] # lid
12 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163e33
caguid=0xf452140300163e30
Ca 2 "H-f452140300163e30" # "lustwzb8 HCA-1"
[1](f452140300163e31) "S-0002c9030089cab0"[24] # lid
10 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163b83
caguid=0xf452140300163b80
Ca 2 "H-f452140300163b80" # "lustwzb7 HCA-1"
[1](f452140300163b81) "S-0002c9030089cab0"[23] # lid
9 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bfb3
caguid=0xf45214030015bfb0
Ca 2 "H-f45214030015bfb0" # "lustwzb5 HCA-1"
[1](f45214030015bfb1) "S-0002c9030089cab0"[21] # lid
7 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163ba3
caguid=0xf452140300163ba0
Ca 2 "H-f452140300163ba0" # "lustwzb6 HCA-1"
[1](f452140300163ba1) "S-0002c9030089cab0"[22] # lid
8 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163df3
caguid=0xf452140300163df0
Ca 2 "H-f452140300163df0" # "lustwzb2 HCA-1"
[1](f452140300163df1) "S-0002c9030089cab0"[18] # lid
6 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf93
caguid=0xf45214030015bf90
Ca 2 "H-f45214030015bf90" # "lustwzb3 HCA-1"
[1](f45214030015bf91) "S-0002c9030089cab0"[19] # lid
4 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c0a3
caguid=0xf45214030015c0a0
Ca 2 "H-f45214030015c0a0" # "lustwzb1 HCA-1"
[1](f45214030015c0a1) "S-0002c9030089cab0"[17] # lid
5 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0x2c90300b78243
caguid=0x2c90300b78240
Ca 1 "H-0002c90300b78240" # "lustwz99 HCA-1"
[1](2c90300b78241) "S-0002c9030089cab0"[16] # lid
1 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163b73
caguid=0xf452140300163b70
Ca 2 "H-f452140300163b70" # "lustwzb4 HCA-1"
[1](f452140300163b71) "S-0002c9030089cab0"[20]
Post by Gus CorreaHi Faraz
1) lsmod | grep ib should show if the Infinband kernel modules are loaded.
2) Infinband normally uses remote DMA (rdma) through "verbs".
You should see an "ib" module with "verbs" in the name.
That is the preferred/faster mode for MPI.
3) However, you can also use Infinband for TCP/IP (slower).
As the output of your ifconfig shows, your ib0 interface is
also configured for TCP/IP.
4) You may have two interfaces (one card with two or two cards) in
the nodes. One may not be connected to a switch (ib1). Check the
back of your nodes.
5) To check if MPI is using it, depends a bit on which MPI library
you're using.
Which one? Open MPI, MVAPICH2, some vendor/proprietary one?
If it is Open MPI the command "ompi-info" will tell.
With Open MPI there are also ways to enable/disable
Infiniband at runtime.
6) Some Infinband diagnostics may also help (normally in /usr/sbin)
ibstat
ibhosts
ibnetdiscover
etc
OK, this is my pedestrian view of Infinband.
Now let's hear the experts in the list for deeper insights. :)
I hope this helps,
Gus Correa
Post by Faraz HussainI have inherited a 20-node cluster that supposedly has an
infiniband network. I am testing some mpi applications and am
seeing no performance improvement with multiple nodes. So I am
wondering if the Infiband network even works?
The output of ifconfig -a shows an ib0 and ib1 network. I ran
Speed: 40000Mb/s
Link detected: no
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further
and restart it?
Thanks!
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit