概述
RDMA(Remote Direct Memory Access)是新一代的网络通信技术,它允许计算机之间直接进行内存对内存的数据传输,而不需要经过操作系统或中央处理器的处理。在大规模的分布式训练中,通过使用RDMA有效解决网络传输中服务器端数据处理的延迟问题,从而实现高吞吐、低延迟的网络通信,提升训练效率。
环境准备
已经创建集群,且集群中至少有2台具有RDMA网络的GPU实例。
GPU实例镜像中包含ofed和nvidia驱动,这里推荐使用百度智能云提供的GPU镜像,已包含OFED驱动,无需手动安装。
集群已安装 云原生AI CCE RDMA Device Plugin、 CCE GPU Manager 、 CCE AI Job Scheduler 和 CCE Deep Learning Frameworks Operator 组件。
验证
登录集群内具有 RDMA 网络的GPU节点,运行以下命令验证主机环境。
$ ofed_info -s #roce驱动版本
MLNX_OFED_LINUX-*.*-*.*.*.*:
验证 Nvidia GPU 驱动
nvidia-smi #nvidia gpu驱动
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:53:00.0 Off | 0 |
| N/A 29C P0 64W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:59:00.0 Off | 0 |
| N/A 32C P0 61W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:6E:00.0 Off | 0 |
| N/A 33C P0 67W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:73:00.0 Off | 0 |
| N/A 29C P0 60W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:8D:00.0 Off | 0 |
| N/A 29C P0 60W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:92:00.0 Off | 0 |
| N/A 32C P0 65W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:C9:00.0 Off | 0 |
| N/A 33C P0 64W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:CF:00.0 Off | 0 |
| N/A 28C P0 62W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
查询 RDMA 网卡
show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_0 1 0 fe80:0000:0000:0000:f820:20ff:fe28:c769 v1 eth0
mlx5_0 1 1 fe80:0000:0000:0000:f820:20ff:fe28:c769 v2 eth0
mlx5_0 1 2 0000:0000:0000:0000:0000:ffff:0a00:3c03 10.0.60.3 v1 eth0
mlx5_0 1 3 0000:0000:0000:0000:0000:ffff:0a00:3c03 10.0.60.3 v2 eth0
mlx5_1 1 0 fe80:0000:0000:0000:eaeb:d3ff:fecc:c920 v1 eth1
mlx5_1 1 1 fe80:0000:0000:0000:eaeb:d3ff:fecc:c920 v2 eth1
mlx5_1 1 2 0000:0000:0000:0000:0000:ffff:190b:8002 25.11.128.2 v1 eth1
mlx5_1 1 3 0000:0000:0000:0000:0000:ffff:190b:8002 25.11.128.2 v2 eth1
mlx5_2 1 0 fe80:0000:0000:0000:eaeb:d3ff:fecc:c921 v1 eth2
mlx5_2 1 1 fe80:0000:0000:0000:eaeb:d3ff:fecc:c921 v2 eth2
mlx5_2 1 2 0000:0000:0000:0000:0000:ffff:190b:8022 25.11.128.34 v1 eth2
mlx5_2 1 3 0000:0000:0000:0000:0000:ffff:190b:8022 25.11.128.34 v2 eth2
mlx5_3 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe6c:51d2 v1 eth3
mlx5_3 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe6c:51d2 v2 eth3
mlx5_3 1 2 0000:0000:0000:0000:0000:ffff:190b:8042 25.11.128.66 v1 eth3
mlx5_3 1 3 0000:0000:0000:0000:0000:ffff:190b:8042 25.11.128.66 v2 eth3
mlx5_4 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe6c:51d3 v1 eth4
mlx5_4 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe6c:51d3 v2 eth4
mlx5_4 1 2 0000:0000:0000:0000:0000:ffff:190b:8062 25.11.128.98 v1 eth4
mlx5_4 1 3 0000:0000:0000:0000:0000:ffff:190b:8062 25.11.128.98 v2 eth4
mlx5_5 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe33:1366 v1 eth5
mlx5_5 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe33:1366 v2 eth5
mlx5_5 1 2 0000:0000:0000:0000:0000:ffff:190b:8082 25.11.128.130 v1 eth5
mlx5_5 1 3 0000:0000:0000:0000:0000:ffff:190b:8082 25.11.128.130 v2 eth5
mlx5_6 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe33:1367 v1 eth6
mlx5_6 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe33:1367 v2 eth6
mlx5_6 1 2 0000:0000:0000:0000:0000:ffff:190b:80a2 25.11.128.162 v1 eth6
mlx5_6 1 3 0000:0000:0000:0000:0000:ffff:190b:80a2 25.11.128.162 v2 eth6
mlx5_7 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe6c:68ae v1 eth7
mlx5_7 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe6c:68ae v2 eth7
mlx5_7 1 2 0000:0000:0000:0000:0000:ffff:190b:80c2 25.11.128.194 v1 eth7
mlx5_7 1 3 0000:0000:0000:0000:0000:ffff:190b:80c2 25.11.128.194 v2 eth7
mlx5_8 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe6c:68af v1 eth8
mlx5_8 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe6c:68af v2 eth8
mlx5_8 1 2 0000:0000:0000:0000:0000:ffff:190b:80e2 25.11.128.226 v1 eth8
mlx5_8 1 3 0000:0000:0000:0000:0000:ffff:190b:80e2 25.11.128.226 v2 eth8
NCCL使用
NCCL是NVIDIA的集合通信库,能实现Collective通信和点对点通信,NCCL内部已经实现了RDMA通信,同时NCCL可以根据环境中网卡类型和拓扑关系,自行选择一个最优的通信路径,目前主流的分布式训练框架都已支持NCCL。