浅谈iblinkinfo排查IB组网接线故障

在复杂的组网环境下,一旦出现网络故障问题如IP无法通信等问题进行排查往往需要投入大量的时间和精力。而对于InfiniBand网络组网环境,则可以通过 iblinkinfo 工具来快速排查网络环境是否正常,在出现故障时及时定位到故障根源并进行恢复处理。

最近在公司环境内遇到一个比较诡异的现象,一套测试环境中的IB网卡接线后从计算层到存储层的网络却无法ping通,查看IB网卡的状态都是正常连通的,并且每张网卡上对应网口的IP配置都正常。由于组网环境分为三层,通过将计算层与存储层的网络连接到InfiniBand交换机上,实现计算层与存储层网络的互通。在这样的配置下,要找到故障的原因则需要从计算层、交换机层、存储层分别进行排查,先大致确定问题出在哪一层上。

计算层网络展示如下

#ibdev2netdev           # ==> node1 IB网口状态
mlx4_0 port 1 ==> ib0 (Up)
mlx4_0 port 2 ==> ib1 (Up)
mlx4_1 port 1 ==> ib2 (Up)
mlx4_1 port 2 ==> ib3 (Up)
#ifconfig -a | grep -E "mtu|inet"       # ==> node1 IP配置情况
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092       
        inet xx.xx.22.12  netmask 255.255.255.0  broadcast xx.xx.22.255
        inet6 fe80::f652:1403:93:1791  prefixlen 64  scopeid 0x20<link>
ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.23.12  netmask 255.255.255.0  broadcast xx.xx.23.255
        inet6 fe80::f652:1403:93:1792  prefixlen 64  scopeid 0x20<link>
ib2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.24.12  netmask 255.255.255.0  broadcast xx.xx.24.255
        inet6 fe80::f652:1403:22:ba41  prefixlen 64  scopeid 0x20<link>
ib3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.25.12  netmask 255.255.255.0  broadcast xx.xx.25.255
        inet6 fe80::f652:1403:22:ba42  prefixlen 64  scopeid 0x20<link>
#ibdev2netdev           # ==> node2 IB网口状态
mlx4_0 port 1 ==> ib0 (Up)
mlx4_0 port 2 ==> ib1 (Up)
mlx4_1 port 1 ==> ib2 (Up)
mlx4_1 port 2 ==> ib3 (Up)
#ifconfig -a | grep -E "mtu|inet"       # ==> node2 IP配置情况
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.22.94  netmask 255.255.255.0  broadcast xx.xx.22.255
        inet6 fe80::f652:1403:72:c171  prefixlen 64  scopeid 0x20<link>
ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.23.94  netmask 255.255.255.0  broadcast xx.xx.23.255
        inet6 fe80::f652:1403:72:c172  prefixlen 64  scopeid 0x20<link>
ib2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.24.94  netmask 255.255.255.0  broadcast xx.xx.24.255
        inet6 fe80::26be:5ff:ffc6:78d1  prefixlen 64  scopeid 0x20<link>
ib3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.25.94  netmask 255.255.255.0  broadcast xx.xx.25.255
        inet6 fe80::26be:5ff:ffc6:78d2  prefixlen 64  scopeid 0x20<link>

存储层网络配置如下

#ibdev2netdev           # ==> node3 IB网口状态
mlx4_0 port 1 ==> ib0 (Up)
mlx4_0 port 2 ==> ib1 (Up)
mlx4_1 port 1 ==> ib2 (Down)
mlx4_1 port 2 ==> ib3 (Down)
#ifconfig -a | grep -E "mtu|inet"       # ==> node3 IP配置情况
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.22.22  netmask 255.255.255.0  broadcast xx.xx.22.255
        inet6 fe80::202:c903:31:8321  prefixlen 64  scopeid 0x20<link>
ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.24.22  netmask 255.255.255.0  broadcast xx.xx.24.255
        inet6 fe80::202:c903:31:8322  prefixlen 64  scopeid 0x20<link>
ib2: flags=4098<BROADCAST,MULTICAST>  mtu 4092
ib3: flags=4098<BROADCAST,MULTICAST>  mtu 4092
#ibdev2netdev           # ==> node4 IB网口状态
mlx4_0 port 1 ==> ib0 (Up)
mlx4_0 port 2 ==> ib1 (Up)
mlx4_1 port 1 ==> ib2 (Down)
mlx4_1 port 2 ==> ib3 (Down)
#ifconfig -a | grep -E "mtu|inet"       # ==> node4 IP配置情况
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.22.23  netmask 255.255.255.0  broadcast xx.xx.22.255
        inet6 fe80::202:c903:30:a9a1  prefixlen 64  scopeid 0x20<link>
ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092
        inet xx.xx.24.23  netmask 255.255.255.0  broadcast xx.xx.24.255
        inet6 fe80::202:c903:30:a9a2  prefixlen 64  scopeid 0x20<link>
ib2: flags=4098<BROADCAST,MULTICAST>  mtu 4092
ib3: flags=4098<BROADCAST,MULTICAST>  mtu 4092

现象

在上面的配置中,计算层之间的两个服务器间的IP可以相互通信,存储层之间的网络也可以互通,但从计算层到存储层的网络却无法通信

#ping 192.168.24.22                 # 存储层测试 node4 ==> node3 
PING xx.xx.24.22 (xx.xx.24.22) 56(84) bytes of data.
64 bytes from xx.xx.24.22: icmp_seq=1 ttl=64 time=0.132 ms
^C
--- xx.xx.24.22 ping statistics ---
1 packets transmitted,  received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.119/0.125/0.132/0.012 ms
#ping 192.168.22.94                 # 计算层测试 node1 ==> node2
PING xx.xx.22.94 (xx.xx.22.94) 56(84) bytes of data.
64 bytes from xx.xx.22.94: icmp_seq=1 ttl=64 time=0.076 ms
^C
--- xx.xx.22.94 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.076/0.076/0.076/0.000 ms
#ping 192.168.22.22                 # 计算层到存储层测试 node2 ==> node3
PING xx.xx.22.22 (xx.xx.22.22) 56(84) bytes of data.
^C
--- xx.xx.22.22 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2001ms

出现此类情况,计算层之间通信正常,存储层之间通信正常,而计算层到存储层的网络无法通信,说明故障发生在交换机层上。这可能会有两种情况,其一交换机发生了故障,这会导致所有连接到交换机的网口都无法通信,也就是说计算层也无法通信,显然,当前的情况并非此类;另一种则可能因为tcp包倒了交换机后,交换机无法接收ip包,而一些场景下网线连接错误可能会导致此类情况。

排查诊断

既然确定了故障位于交换机层,组网环境也是基于InfiniBand网络的组网,且前面已经看到了IB网卡的状态是正常工作的,说明网线都正常连接到交换机上了,但并不排除IB网线可能插错的情况。排查集群内部IB网卡的接线情况,那么就需要使用到 iblinkinfo 工具,该工具会输出交换机上的所有接线网卡的详细信息,包含网卡速率、对接的交换机网口、网口名称、位宽、通道等信息。

因为交换机处于网络的中间层,因此仍需要在查看集群内部所有节点上的IB网口接线情况。

下为计算层所有节点的IB网口接线拓扑

计算层node1服务器上HCA-1网卡(mlx4_0)的两个端口连接到jfyl-ib-switches-1交换机上,HCA-2网卡(mlx4_1)的两个端口连接到jfyl-ib-switches-2交换机上

#iblinkinfo -C mlx4_0 -P 1              # ==> iblinkinfo -C mlx4_0 -P 2
Switch: 0xe41d2d0300xxxxxx MF0;jfyl-ib-switches-1:SX6xxx/U1:
          12    1[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    2[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    3[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    4[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    5[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    9[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   10[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   11[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   12[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   13[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   14[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   15[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   16[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   17[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   18[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      36    2[  ] "node2 HCA-1" ( )
          12   19[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   20[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      38    1[  ] "node2 HCA-1" ( )
          12   21[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      37    2[  ] "node4 HCA-1" ( )
          12   22[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      39    1[  ] "node1 HCA-1" ( )
          12   23[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      41    2[  ] "node3 HCA-1" ( )
          12   24[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      40    2[  ] "node1 HCA-1" ( )
          12   25[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   27[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   28[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   29[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   30[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   31[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   32[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   33[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   34[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   35[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   36[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: node1 HCA-1:
      0xf452140300xxxxxx     39    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      12   22[  ] "MF0;jfyl-ib-switches-1:SX6xxx/U1" ( )
      0xf452140300xxxxxx     40    2[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      12   24[  ] "MF0;jfyl-ib-switches-1:SX6xxx/U1" ( )

#iblinkinfo -C mlx4_1 -P 1              # ==> iblinkinfo -C mlx4_1 -P 2
Switch: 0x7cfe900300xxxxxx MF0;jfyl-ib-switches-2:SX6xxx/U1:
          11    1[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    2[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    3[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    4[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    5[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      23    1[  ] "node4 HCA-1" ( )
          11    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    9[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   10[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   11[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   12[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   13[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   14[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   15[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      28    2[  ] "node2 HCA-2" ( )
          11   16[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   17[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      26    1[  ] "node2 HCA-2" ( )
          11   18[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   19[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      36    1[  ] "node1 HCA-2" ( )
          11   20[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   21[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      29    2[  ] "node1 HCA-2" ( )
          11   22[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   23[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      19    1[  ] "node3 HCA-1" ( )
          11   24[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   25[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   27[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   28[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   29[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   30[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   31[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   32[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   33[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   34[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   35[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   36[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   37[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: node1 HCA-2:
      0xf452140300xxxxxx     36    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      11   19[  ] "MF0;jfyl-ib-switches-2:SX6xxx/U1" ( )
      0xf452140300xxxxxx     29    2[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      11   21[  ] "MF0;jfyl-ib-switches-2:SX6xxx/U1" ( )

计算层node2服务器上HCA-1网卡(mlx4_0)的两个端口连接到jfyl-ib-switches-1交换机上,HCA-2网卡(mlx4_1)的两个端口连接到jfyl-ib-switches-2交换机上

#iblinkinfo -C mlx4_0 -P 1              # ==> iblinkinfo -C mlx4_0 -P 2
Switch: 0xe41d2d0300xxxxxx MF0;jfyl-ib-switches-1:SX6xxx/U1:
          12    1[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    2[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    3[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    4[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    5[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    9[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   10[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   11[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   12[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   13[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   14[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   15[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   16[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   17[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   18[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      36    2[  ] "node2 HCA-1" ( )
          12   19[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   20[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      38    1[  ] "node2 HCA-1" ( )
          12   21[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      37    2[  ] "node4 HCA-1" ( )
          12   22[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      39    1[  ] "node1 HCA-1" ( )
          12   23[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      41    2[  ] "node3 HCA-1" ( )
          12   24[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      40    2[  ] "node1 HCA-1" ( )
          12   25[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   27[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   28[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   29[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   30[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   31[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   32[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   33[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   34[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   35[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   36[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: node2 HCA-1:
      0xf452140300xxxxxx     38    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      12   20[  ] "MF0;jfyl-ib-switches-1:SX6xxx/U1" ( )
      0xf452140300xxxxxx     36    2[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      12   18[  ] "MF0;jfyl-ib-switches-1:SX6xxx/U1" ( )

#iblinkinfo -C mlx4_1 -P 1              # ==> iblinkinfo -C mlx4_1 -P 2
Switch: 0x7cfe900300xxxxxx MF0;jfyl-ib-switches-2:SX6xxx/U1:
          11    1[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    2[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    3[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    4[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    5[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      23    1[  ] "node4 HCA-1" ( )
          11    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    9[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   10[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   11[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   12[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   13[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   14[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   15[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      28    2[  ] "node2 HCA-2" ( )
          11   16[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   17[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      26    1[  ] "node2 HCA-2" ( )
          11   18[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   19[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      36    1[  ] "node1 HCA-2" ( )
          11   20[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   21[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      29    2[  ] "node1 HCA-2" ( )
          11   22[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   23[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      19    1[  ] "node3 HCA-1" ( )
          11   24[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   25[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   27[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   28[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   29[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   30[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   31[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   32[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   33[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   34[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   35[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   36[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   37[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: node2 HCA-2:
      0x24be05ffffxxxxxx     26    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      11   17[  ] "MF0;jfyl-ib-switches-2:SX6xxx/U1" ( )
      0x24be05ffffxxxxxx     28    2[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      11   15[  ] "MF0;jfyl-ib-switches-2:SX6xxx/U1" ( )

存储层node3服务器上HCA-1网卡(mlx4_0)的一端口连接到jfyl-ib-switches-2交换机上,HCA-1网卡(mlx4_0)的二端口连接到jfyl-ib-switches-1交换机上

#iblinkinfo -C mlx4_0 -P 1
Switch: 0x7cfe900300xxxxxx MF0;jfyl-ib-switches-2:SX6xxx/U1:
          11    1[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    2[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    3[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    4[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    5[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      23    1[  ] "node4 HCA-1" ( )
          11    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    9[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   10[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   11[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   12[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   13[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   14[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   15[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      28    2[  ] "node2 HCA-2" ( )
          11   16[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   17[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      26    1[  ] "node2 HCA-2" ( )
          11   18[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   19[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      36    1[  ] "node1 HCA-2" ( )
          11   20[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   21[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      29    2[  ] "node1 HCA-2" ( )
          11   22[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   23[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      19    1[  ] "node3 HCA-1" ( )
          11   24[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   25[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   27[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   28[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   29[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   30[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   31[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   32[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   33[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   34[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   35[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   36[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   37[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: node3 HCA-1:
      0x0002c90300xxxxxx     19    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      11   23[  ] "MF0;jfyl-ib-switches-2:SX6xxx/U1" ( )

#iblinkinfo -C mlx4_0 -P 2
Switch: 0xe41d2d0300xxxxxx MF0;jfyl-ib-switches-1:SX6xxx/U1:
          12    1[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    2[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    3[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    4[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    5[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    9[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   10[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   11[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   12[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   13[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   14[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   15[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   16[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   17[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   18[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      36    2[  ] "node2 HCA-1" ( )
          12   19[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   20[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      38    1[  ] "node2 HCA-1" ( )
          12   21[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      37    2[  ] "node4 HCA-1" ( )
          12   22[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      39    1[  ] "node1 HCA-1" ( )
          12   23[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      41    2[  ] "node3 HCA-1" ( )
          12   24[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      40    2[  ] "node1 HCA-1" ( )
          12   25[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   27[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   28[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   29[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   30[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   31[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   32[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   33[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   34[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   35[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   36[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: node3 HCA-1:
      0x0002c90300xxxxxx     41    2[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      12   23[  ] "MF0;jfyl-ib-switches-1:SX6xxx/U1" ( )

存储层node4服务器上HCA-1网卡(mlx4_0)的一端口连接到jfyl-ib-switches-2交换机上,HCA-1网卡(mlx4_0)的二端口连接到jfyl-ib-switches-1交换机上

#iblinkinfo -C mlx4_0 -P 1
Switch: 0x7cfe900300xxxxxx MF0;jfyl-ib-switches-2:SX6xxx/U1:
          11    1[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    2[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    3[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    4[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    5[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      23    1[  ] "node4 HCA-1" ( )
          11    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11    9[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   10[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   11[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   12[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   13[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   14[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   15[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      28    2[  ] "node2 HCA-2" ( )
          11   16[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   17[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      26    1[  ] "node2 HCA-2" ( )
          11   18[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   19[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      36    1[  ] "node1 HCA-2" ( )
          11   20[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   21[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      29    2[  ] "node1 HCA-2" ( )
          11   22[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   23[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      19    1[  ] "node3 HCA-1" ( )
          11   24[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   25[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   27[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   28[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   29[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   30[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   31[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   32[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   33[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   34[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   35[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   36[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          11   37[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: node4 HCA-1:
      0x0002c90300xxxxxx     23    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      11    5[  ] "MF0;jfyl-ib-switches-2:SX6xxx/U1" ( )

#iblinkinfo -C mlx4_0 -P 2
Switch: 0xe41d2d0300xxxxxx MF0;jfyl-ib-switches-1:SX6xxx/U1:
          12    1[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    2[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    3[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    4[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    5[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12    9[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   10[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   11[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   12[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   13[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   14[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   15[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   16[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   17[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   18[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      36    2[  ] "node2 HCA-1" ( )
          12   19[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   20[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      38    1[  ] "node2 HCA-1" ( )
          12   21[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      37    2[  ] "node4 HCA-1" ( )
          12   22[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      39    1[  ] "node1 HCA-1" ( )
          12   23[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      41    2[  ] "node3 HCA-1" ( )
          12   24[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      40    2[  ] "node1 HCA-1" ( )
          12   25[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   27[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   28[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   29[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   30[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   31[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   32[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   33[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   34[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   35[  ] ==(                Down/ Polling)==>             [  ] "" ( )
          12   36[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: node4 HCA-1:
      0x0002c90300xxxxxx     37    2[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>      12   21[  ] "MF0;jfyl-ib-switches-1:SX6xxx/U1" ( )

通过4台服务器的IB组网拓扑和网口对应的IP不难发现,计算层HCA-1网卡接线在jfyl-ib-switches-1交换机上,对应网口ip的网络段为192.168.22.x 和 192.168.23.x,而存储层HCA-1网卡2端口接线在jfyl-ib-switches-2交换机上,对应网口的ip网络段为xx.xx.24.x,因此从计算层ping存储层的网络段时,相当于以192.168.22.x 网络段与192.168.24.x网络段进行通信,因此网络无法通信。

再看计算层HCA-2网卡连线连在jfyl-ib-switches-2交换机上,对应网口的ip为192.168.24.x 和 192.168.25.x,而存储层HCA-1网卡1端口连线连在jfyl-ib-switches-1交换机上,从计算层ping存储层的网络时,相当于以192.168.24.x 网络段与192.168.22.x 进行通信,自然也无法通信。

至此,网络故障的原因已经找到,后续经过重新连接存储层到交换机上的IB线缆后,集群内部计算层到存储层的通信恢复正常。

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 225,764评论 6 524
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 96,848评论 3 409
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 173,181评论 0 370
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 61,430评论 1 303
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 70,451评论 6 403
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 53,879评论 1 316
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 42,163评论 3 432
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 41,189评论 0 281
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 47,727评论 1 328
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 39,741评论 3 350
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 41,852评论 1 358
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 37,444评论 5 352
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 43,162评论 3 341
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 33,569评论 0 25
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 34,735评论 1 278
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 50,443评论 3 383
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 46,931评论 2 368

推荐阅读更多精彩内容