问题记录
镜像内部安装了mpi4py,通过容器启动后,执行如下命令
python3 -c "from mpi4py import MPI"
提示如下错误:
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
opal_init:startup:internal-failure
But I couldn't open the help file:
/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi/share/openmpi/help-opal-runtime.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
orte_init:startup:internal-failure
But I couldn't open the help file:
/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi/share/openmpi/help-orte-runtime: No such file or directory. Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
mpi_init:startup:internal-failure
But I couldn't open the help file:
/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi/share/openmpi/help-mpi-runtime.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
此类错误往往是环境变量改变导致,如果环境变量是依附镜像,在起容器时,用不同用户会出现无对应环境变量情况,所在在导包后会提示相关文件不存在。
解决办法
添加对应环境变量
export LD_LIBRARY_PATH=/usr/local/mpi/lib:LD_LIBRARY_PATH
export OPAL_PREFIX=/opt/hpcx/ompi/
此类问题,最好的解决办法是把相关环境变量配置到镜像内部的 ~/.bashrc
文件中