1. 准备工作
安装cuda,并设置环境变量:
export GPU_INCLUDE_PATH=/usr/local/cuda-11.2/include
export GPU_LIBRARY_PATH=/usr/local/cuda-11.2/lib64
export PATH="/usr/local/cuda-11.2/bin:$PATH"
不同GPU设备下的编译问题
根据不同的GPU设备,选择不同的计算能力ComputeCapability,参考:
https://en.wikipedia.org/wiki/CUDA#GPUs_supported
比如RTX 3090
对应的算力是8.6,TARGETS
应该是86。
可以修改下面这一行语句,也可以make时增加参数TARGETS="86"
TARGETS = 52 60 61 70 86
- 类似Issue
https://github.com/ccsb-scripps/AutoDock-GPU/issues/172
https://github.com/ccsb-scripps/AutoDock-GPU/issues/180
https://github.com/ccsb-scripps/AutoDock-GPU/issues/249
确认nvcc是否支持当前的算力参数
如下,nvcc支持compute_86
,那么可以设置TARGETS="86"
。否则,make时会报错
$ nvcc --help
...
--gpu-code <code>,... (-code)
Specify the name of the NVIDIA GPU to assemble and optimize PTX for.
nvcc embeds a compiled code image in the resulting executable for each specified
<code> architecture, which is a true binary load image for each 'real' architecture
(such as sm_50), and PTX code for the 'virtual' architecture (such as compute_50).
During runtime, such embedded PTX code is dynamically compiled by the CUDA
runtime system if no binary load image is found for the 'current' GPU.
Architectures specified for options '--gpu-architecture' and '--gpu-code'
may be 'virtual' as well as 'real', but the <code> architectures must be
compatible with the <arch> architecture. When the '--gpu-code' option is
used, the value for the '--gpu-architecture' option must be a 'virtual' PTX
architecture.
For instance, '--gpu-architecture=compute_60' is not compatible with '--gpu-code=sm_52',
because the earlier compilation stages will assume the availability of 'compute_60'
features that are not present on 'sm_52'.
Note: the values compute_30, compute_32, compute_35, compute_37, compute_50,
sm_30, sm_32, sm_35, sm_37 and sm_50 are deprecated and may be removed in
a future release.
Allowed values for this option: 'compute_35','compute_37','compute_50',
'compute_52','compute_53','compute_60','compute_61','compute_62','compute_70',
'compute_72','compute_75','compute_80','compute_86','lto_35','lto_37','lto_50',
'lto_52','lto_53','lto_60','lto_61','lto_62','lto_70','lto_72','lto_75',
'lto_80','lto_86','sm_35','sm_37','sm_50','sm_52','sm_53','sm_60','sm_61',
'sm_62','sm_70','sm_72','sm_75','sm_80','sm_86'.
...
2. 编译
The first step is to set environmental variables GPU_INCLUDE_PATH
and GPU_LIBRARY_PATH
,
as described here: https://github.com/ccsb-scripps/AutoDock-GPU/wiki/Guideline-for-users
- Template
make DEVICE=<TYPE> NUMWI=<NWI>
- Example
make DEVICE=CUDA NUMWI=256 OVERLAP=ON TARGETS="86"
Parameters | Description | Values |
---|---|---|
<TYPE> |
Accelerator chosen |
CPU , GPU , CUDA , OCLGPU
|
<NWI> |
work-group/thread block size, Number of work-items (wi) |
1 , 2 , 4 , 8 , 16 , 32 , 64 , 128 , 256
|
When DEVICE=GPU
is chosen, the Makefile will automatically tests if it can compile Cuda succesfully. To override, use DEVICE=CUDA
or DEVICE=OCLGPU
. The cpu target is only supported using OpenCL. Furthermore, an OpenMP-enabled overlapped pipeline (for setup and processing) can be compiled with OVERLAP=ON
.
Hints: The best work-group size depends on the GPU and workload. Try NUMWI=128
or NUMWI=64
for modern cards with the example workloads. On macOS, use NUMWI=1
for CPUs.
After successful compilation, the host binary autodock_<type>_<N>wi is placed under bin.
Binary-name portion | Description | Values |
---|---|---|
<type> | Accelerator chosen |
cpu , gpu
|
<N> | work-group/thread block size |
1 , 2 , 4 , 8 ,16 , 32 , 64 , 128 , 256
|
3. 测试
$ ./autodock_gpu_256wi --lfile /home/shuzhang/ai/code/moldock/autodock/output/tmpnnuuab_g.pdbqt --ffile /data/autodock/grid/0cb544cb1474ff6d917fe409598886cb/protein.maps.fld --devnum 2 --ngen 1 --nrun 2 --stopstd 1.999
AutoDock-GPU version: v1.5.3-73-gf5cf6ffdd0c5b3f113d5cc424fabee51df04da7e
Running 1 docking calculation
Cuda device: NVIDIA GeForce RTX 3090 (#2 / 6)
Available memory on device: 21182 MB (total: 24268 MB)
CUDA Setup time 0.248527s
(Thread 52 is setting up Job #1)
Running Job #1
Using heuristics: (capped) number of evaluations set to 6122449
Warning: The set number of evals is 48.98% of the uncapped heuristics estimate of 12500000 evals.
This means this docking may not be able to converge. Increasing --heurmax may improve
convergence but will also increase runtime.
AutoStop will not stop before 10.50% (643004) of the set number of evaluations.
Local-search chosen method is: ADADELTA (ad)
Rest of Setup time 0.006511s
Executing docking runs, stopping automatically after either reaching 2.00 kcal/mol standard deviation of
the best molecules of the last 4 * 5 generations, 1 generations, or 6122449 evaluations:
Generations | Evaluations | Threshold | Average energy of best 10% | Samples | Best Inter + Intra
------------+--------------+------------------+------------------------------+---------+-------------------
0 | 150 | 1145.95 kcal/mol | 312.45 +/- 222.27 kcal/mol | 4 | 80.37 kcal/mol
1 | 9167 | 1145.95 kcal/mol | 91.60 +/- 157.57 kcal/mol | 56 | -7.85 kcal/mol
------------+--------------+------------------+------------------------------+---------+-------------------
Finished evaluation after reaching
9167 evaluations. Best inter + intra -7.85 kcal/mol.
Docking time 0.002292s
Shutdown time 0.002002s
Job #1 took 0.011 sec after waiting 0.347 sec for setup
(Thread 52 is processing Job #1)
Run time of entire job set (1 file): 0.360 sec
Processing time: 0.002 sec
All jobs ran without errors.