Tengine介绍
OADI/Tengine | github
Tengine 是OPEN AI LAB 为嵌入式设备开发的一个轻量级、高性能并且模块化的引擎。
Tengine在嵌入式设备上支持CPU,GPU,DLA/NPU,DSP异构计算的计算框架,实现异构计算的调度器,基于ARM平台的高效的计算库实现,针对特定硬件平台的性能优化,动态规划计算图的内存使用,提供对于网络远端AI计算能力的访问支持,支持多级别并行,整个系统模块可拆卸,基于事件驱动的计算模型,吸取已有AI计算框架的优点,设计全新的计算图表示。
安装Tengine
配置编译环境
sudo apt install git cmake
sudo apt install libprotobuf-dev protobuf-compiler libboost-all-dev libgoogle-glog-dev protobuf-compiler
下载并配置编译选项
git clone https://github.com/OAID/tengine.git
cd ~/tengine
cp makefile.config.example makefile.config
vim makefile.config
修改为以下内容:
# Set the target arch
CONFIG_ARCH_ARM64=y
# Enable Compiling Optimization
CONFIG_OPT_CFLAGS=-O2
# Use BLAS as the operator implementation
CONFIG_ARCH_BLAS=y
# Enable GPU support by Arm Computing Library
# CONFIG_ACL_GPU=y
# Set the path of ACL
# ACL_ROOT=/home/firefly/ComputeLibrary
# Enable other serializers
CONFIG_CAFFE_SERIALIZER=y
CONFIG_MXNET_SERIALIZER=y
CONFIG_ONNX_SERIALIZER=y
CONFIG_TF_SERIALIZER=y
CONFIG_TENGINE_SERIALIZER=y
链接OPENCV
tengine需要链接opencv,以下代码用于生成链接:
sudo vim /etc/ld.so.conf
在打开的文件上中增加opencv的安装路径:
/usr/local/lib
执行sudo ldconfig
完成配置。
如果usr/local/lib
下的opencv均为有版本号的“.so”文件,tengine会链接不到,需要修改,以下脚本用来生成“.so”的软链接:
# coding=utf-8
import os,sys
import shutil
import struct
file_list = []
def listdir(folder, file_list):
fileNum = 0
new_file_list = os.listdir(folder)
for line in new_file_list:
filepath = os.path.join(folder,line)
if os.path.isfile(filepath):
#print(line)
file_list.append(line)
fileNum = fileNum + 1
#change .jpg.txt to .txt
def ChangeFileName(folder, file_list):
for file_line in file_list:
old_file_name = file_line
new_file_name = file_line.replace(".so.3.4.5", ".so")
#print("new: " + new_file_name)
#print("old: " + old_file_name)
if new_file_name != old_file_name:
if os.path.exists(os.path.join(folder, new_file_name)):
print("file exist: " + new_file_name)
else:
#print("file_name:" + old_file_name)
cmd = "sudo ln -s "+ old_file_name + " " + new_file_name
print(cmd)
os.system("sudo ln -s "+ old_file_name + " " + new_file_name)
#folder = sys.argv[1]
folder = os.getcwd()
print(folder)
listdir(folder, file_list)
ChangeFileName(folder, file_list)
将文件存为“filename.py”,拷贝到/usr/local/lib
下,通过sudo python3 filename.py
运行,创建软链接,如图所示以.so
结束的库文件链接到了.so.3.4.5
的库文件:
-rw-r--r-- 1 root root 365064 1月 31 06:28 libopencv_surface_matching.so.3.4.5
lrwxrwxrwx 1 root root 23 2月 19 00:09 libopencv_text.so -> libopencv_text.so.3.4.5
lrwxrwxrwx 1 root root 23 2月 15 06:00 libopencv_text.so.3.4 -> libopencv_text.so.3.4.5
-rw-r--r-- 1 root root 428928 1月 31 07:06 libopencv_text.so.3.4.5
lrwxrwxrwx 1 root root 27 2月 19 00:08 libopencv_tracking.so -> libopencv_tracking.so.3.4.5
lrwxrwxrwx 1 root root 27 2月 15 06:00 libopencv_tracking.so.3.4 -> libopencv_tracking.so.3.4.5
-rw-r--r-- 1 root root 2336240 1月 31 07:38 libopencv_tracking.so.3.4.5
lrwxrwxrwx 1 root root 26 2月 19 00:33 libopencv_videoio.so -> libopencv_videoio.so.3.4.5
lrwxrwxrwx 1 root root 26 2月 15 06:00 libopencv_videoio.so.3.4 -> libopencv_videoio.so.3.4.5
-rw-r--r-- 1 root root 369296 1月 31 05:55 libopencv_videoio.so.3.4.5
lrwxrwxrwx 1 root root 24 2月 19 00:10 libopencv_video.so -> libopencv_video.so.3.4.5
lrwxrwxrwx 1 root root 24 2月 15 06:00 libopencv_video.so.3.4 -> libopencv_video.so.3.4.5
-rw-r--r-- 1 root root 423112 1月 31 06:29 libopencv_video.so.3.4.5
lrwxrwxrwx 1 root root 28 2月 19 03:53 libopencv_videostab.so -> libopencv_videostab.so.3.4.5
lrwxrwxrwx 1 root root 28 2月 15 06:00 libopencv_videostab.so.3.4 -> libopencv_videostab.so.3.4.5
-rw-r--r-- 1 root root 365104 1月 31 07:40 libopencv_videostab.so.3.4.5
lrwxrwxrwx 1 root root 30 2月 18 15:08 libopencv_xfeatures2d.so -> libopencv_xfeatures2d.so.3.4.5
lrwxrwxrwx 1 root root 30 2月 15 06:00 libopencv_xfeatures2d.so.3.4 -> libopencv_xfeatures2d.so.3.4.5
-rw-r--r-- 1 root root 2836736 1月 31 07:43 libopencv_xfeatures2d.so.3.4.5
lrwxrwxrwx 1 root root 27 2月 19 00:11 libopencv_ximgproc.so -> libopencv_ximgproc.so.3.4.5
lrwxrwxrwx 1 root root 27 2月 15 06:00 libopencv_ximgproc.so.3.4 -> libopencv_ximgproc.so.3.4.5
-rw-r--r-- 1 root root 1299184 1月 31 07:48 libopencv_ximgproc.so.3.4.5
lrwxrwxrwx 1 root root 29 2月 19 00:11 libopencv_xobjdetect.so -> libopencv_xobjdetect.so.3.4.5
lrwxrwxrwx 1 root root 29 2月 15 06:00 libopencv_xobjdetect.so.3.4 -> libopencv_xobjdetect.so.3.4.5
-rw-r--r-- 1 root root 99232 1月 31 07:51 libopencv_xobjdetect.so.3.4.5
lrwxrwxrwx 1 root root 25 2月 19 03:53 libopencv_xphoto.so -> libopencv_xphoto.so.3.4.5
lrwxrwxrwx 1 root root 25 2月 15 06:00 libopencv_xphoto.so.3.4 -> libopencv_xphoto.so.3.4.5
-rw-r--r-- 1 root root 242880 1月 31 06:32 libopencv_xphoto.so.3.4.5
编译测试
在tengine目录下编译,编译完后运行测试程序进行测试:
sudo make
sudo make install
./build/tests/bin/bench_sqz
run-time library version: 1.0.0-github
REPEAT COUNT= 100
Repeat [100] time 55990.35 us per RUN. used 5599035 us
0.2763 - "n02123045 tabby, tabby cat"
0.2673 - "n02123159 tiger cat"
0.1766 - "n02119789 kit fox, Vulpes macrotis"
0.0827 - "n02124075 Egyptian cat"
0.0777 - "n02085620 Chihuahua"
--------------------------------------
ALL TEST DONE
./build/tests/bin/bench_mobilenet
run-time library version: 1.0.0-github
REPEAT COUNT= 100
Repeat [100] time 56649.14 us per RUN. used 5664914 us
8.5976 - "n02123159 tiger cat"
7.9550 - "n02119022 red fox, Vulpes vulpes"
7.8679 - "n02119789 kit fox, Vulpes macrotis"
7.4274 - "n02113023 Pembroke, Pembroke Welsh corgi"
6.3646 - "n02123045 tabby, tabby cat"
ALL TEST DONE
运行Tengine自带的MobileNet SSD
编译测试代码
在tengine目录下example文件夹中有一个mobilenet_ssd的子目录,打开CMakeLists.txt,在set( INSTALL_DIR ${TENGINE_DIR}/install/)
前增加一句设置TENGINE_DIR值的语句:
set( TENGINE_DIR ~/work/Tengine )
下载模型文件
Tengine提供了模型下载——Tengine_models|百度云(提取码:57vb)
找到mobilenet_ssd文件夹把其中的MobileNetSSD_deploy.prototxt和MobileNetSSD_deploy.caffemodel下载下来放到./models目录下就行,以下代码用于将下载在ftp文件夹里的模型传到小机中。
wget ftp://192.168.199.1/sda1/MobileNetSSD_deploy.prototxt --ftp-user=root --ftp-password="password"
wget ftp://192.168.199.1/sda1/MobileNetSSD_deploy.caffemodel --ftp-user=root --ftp-password="password"
编译
编译MobileNet SSD示例:
cd ~/work/Tengine/examples/mobilenet_ssd
cmake .
make
./MSSD -i test.jpg
/home/dolphin/work/tengine/examples/mobilenet_ssd/MSSD
proto file not specified,using /home/dolphin/work/tengine/models/MobileNetSSD_deploy.prototxt by default
model file not specified,using /home/dolphin/work/tengine/models/MobileNetSSD_deploy.caffemodel by default
--------------------------------------
repeat 1 times, avg time per run is 118.913 ms
detect result num: 6
dog :99%
BOX:( 322.588 , 232.231 ),( 455.996 , 330.833 )
person :99%
BOX:( 213.043 , 153.082 ),( 310.846 , 322.655 )
person :96%
BOX:( 536.058 , 76.8777 ),( 709.835 , 391.781 )
dog :90%
BOX:( 177.256 , 296.386 ),( 258.995 , 461.81 )
person :89%
BOX:( 499.474 , 72.645 ),( 619.208 , 369.286 )
person :74%
BOX:( 149.663 , 130.89 ),( 217.314 , 245.324 )
======================================
[DETECTED IMAGE SAVED]: save.jpg
======================================
运行后会存储一张加框的图片“save.jpg”。
运行MobileNet SSD并检测视频
修改mssd.cpp的代码:
#include <unistd.h>
#include <iostream>
#include <iomanip>
#include <string>
#include <vector>
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include "tengine_c_api.h"
#include <sys/time.h>
#include <stdio.h>
#include "common.hpp"
#include <pthread.h>
#include <sched.h>
#define DEF_PROTO "../../models/MobileNetSSD_deploy.prototxt"
#define DEF_MODEL "../../models/MobileNetSSD_deploy.caffemodel"
struct Box
{
float x0;
float y0;
float x1;
float y1;
int class_idx;
float score;
};
void get_input_data_ssd(cv::Mat img, float* input_data, int img_h, int img_w){
if (img.empty()){
std::cerr << "Failed to read image from camera.\n";
return;
}
cv::resize(img, img, cv::Size(img_h, img_w));
img.convertTo(img, CV_32FC3);
float *img_data = (float *)img.data;
int hw = img_h * img_w;
float mean[3]={127.5,127.5,127.5};
for (int h = 0; h < img_h; h++){
for (int w = 0; w < img_w; w++){
for (int c = 0; c < 3; c++){
input_data[c * hw + h * img_w + w] = 0.007843* (*img_data - mean[c]);
img_data++;
}
}
}
}
void post_process_ssd(cv::Mat img, float threshold,float* outdata,int num){
#if 0
const char* class_names[] = {"background",
"airplane", "bicycle", "bird", "boat",
"bus", "car", "chair", "dog", "motorcycle",
"panther", "tiger"};
#else
const char* class_names[] = {"background", "aeroplane", "bicycle", "bird", "boat", "bottle",
"bus", "car", "cat", "chair", "cow", "diningtable",
"dog", "horse", "motorbike", "person", "pottedplant", "sheep",
"sofa", "train", "tvmonitor"};
#endif
int raw_h = img.size().height;
int raw_w = img.size().width;
std::vector<Box> boxes;
int line_width=raw_w*0.002;
printf("detect result num: %d \n",num);
for (int i=0;i<num;i++){
if(outdata[1]>=threshold){
Box box;
box.class_idx=outdata[0];
box.score=outdata[1];
box.x0=outdata[2]*raw_w;
box.y0=outdata[3]*raw_h;
box.x1=outdata[4]*raw_w;
box.y1=outdata[5]*raw_h;
boxes.push_back(box);
printf("%s\t:%.0f%%\n", class_names[box.class_idx], box.score * 100);
printf("BOX:( %g , %g ),( %g , %g )\n",box.x0,box.y0,box.x1,box.y1);
}
outdata+=6;
}
#if 0
for(int i=0;i<(int)boxes.size();i++){
Box box=boxes[i];
cv::rectangle(img, cv::Rect(box.x0, box.y0,(box.x1-box.x0),(box.y1-box.y0)),cv::Scalar(255, 255, 0),line_width);
std::ostringstream score_str;
score_str<<box.score;
std::string label = std::string(class_names[box.class_idx]) + ": " + score_str.str();
int baseLine = 0;
cv::Size label_size = cv::getTextSize(label, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
cv::rectangle(img, cv::Rect(cv::Point(box.x0,box.y0- label_size.height),
cv::Size(label_size.width, label_size.height + baseLine)),
cv::Scalar(255, 255, 0), CV_FILLED);
cv::putText(img, label, cv::Point(box.x0, box.y0),
cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));
}
#endif
}
float outdata[15*6];
cv::Mat frame;
int detect_num;
bool quit_flag = false;
graph_t graph;
pthread_mutex_t m_frame, m_outdata, m_quit;
void *th_vedio(void *){
//cv::VideoCapture capture(0); // usb camera
cv::VideoCapture capture("test.mp4"); // vedio
capture.set(CV_CAP_PROP_FRAME_WIDTH, 960);
capture.set(CV_CAP_PROP_FRAME_HEIGHT, 540);
#if 0
cv::namedWindow("MSSD", CV_WINDOW_NORMAL);
cvResizeWindow("MSSD", 1280, 720);
#endif
while(1){
float show_threshold=0.25;
pthread_mutex_lock(&m_frame);
capture >> frame;
pthread_mutex_lock(&m_outdata);
post_process_ssd(frame, show_threshold, outdata, detect_num);
pthread_mutex_unlock(&m_outdata);
#if 0
cv::imshow("MSSD", frame);
#endif
pthread_mutex_unlock(&m_frame);
if( cv::waitKey(10) == 'q' ){
pthread_mutex_lock(&m_quit);
quit_flag = true;
pthread_mutex_unlock(&m_quit);
break;
}
usleep(500000);
}
}
void *th_detect(void*){
// input
int img_h = 300;
int img_w = 300;
int img_size = img_h * img_w * 3;
float *input_data = (float *)malloc(sizeof(float) * img_size);
int node_idx=0;
int tensor_idx=0;
tensor_t input_tensor = get_graph_input_tensor(graph, node_idx, tensor_idx);
if(!check_tensor_valid(input_tensor)){
printf("Get input node failed : node_idx: %d, tensor_idx: %d\n",node_idx,tensor_idx);
return NULL;
}
int dims[] = {1, 3, img_h, img_w};
set_tensor_shape(input_tensor, dims, 4);
prerun_graph(graph);
int repeat_count = 1;
const char *repeat = std::getenv("REPEAT_COUNT");
if (repeat)
repeat_count = std::strtoul(repeat, NULL, 10);
int out_dim[4];
tensor_t out_tensor;
while(1){
pthread_mutex_lock(&m_quit);
if(quit_flag) break;
pthread_mutex_unlock(&m_quit);
struct timeval t0, t1;
float total_time = 0.f;
for (int i = 0; i < repeat_count; i++){
pthread_mutex_lock(&m_frame);
get_input_data_ssd(frame, input_data, img_h, img_w);
pthread_mutex_unlock(&m_frame);
gettimeofday(&t0, NULL);
set_tensor_buffer(input_tensor, input_data, img_size * 4);
run_graph(graph, 1);
gettimeofday(&t1, NULL);
float mytime = (float)((t1.tv_sec * 1000000 + t1.tv_usec) - (t0.tv_sec * 1000000 + t0.tv_usec)) / 1000;
total_time += mytime;
}
std::cout << "--------------------------------------\n";
std::cout << "repeat " << repeat_count << " times, avg time per run is " << total_time / repeat_count << " ms\n";
out_tensor = get_graph_output_tensor(graph, 0,0);
get_tensor_shape( out_tensor, out_dim, 4);
pthread_mutex_lock(&m_outdata);
detect_num = out_dim[1] <= 15 ? out_dim[1] : 15;
memcpy(outdata, get_tensor_buffer(out_tensor), sizeof(float)*detect_num*6);
pthread_mutex_unlock(&m_outdata);
}
free(input_data);
}
int main(int argc, char *argv[])
{
const std::string root_path = get_root_path();
std::string proto_file;
std::string model_file;
int res;
while( ( res=getopt(argc,argv,"p:m:h"))!= -1){
switch(res){
case 'p':
proto_file=optarg;
break;
case 'm':
model_file=optarg;
break;
case 'h':
std::cout << "[Usage]: " << argv[0] << " [-h]\n"
<< " [-p proto_file] [-m model_file]\n";
return 0;
default:
break;
}
}
const char *model_name = "mssd_300";
if(proto_file.empty()){
proto_file = DEF_PROTO;
std::cout<< "proto file not specified,using "<< proto_file << " by default\n";
}
if(model_file.empty()){
model_file = DEF_MODEL;
std::cout<< "model file not specified,using "<< model_file << " by default\n";
}
// init tengine
init_tengine_library();
if (request_tengine_version("0.1") < 0)
return 1;
if (load_model(model_name, "caffe", proto_file.c_str(), model_file.c_str()) < 0)
return 1;
std::cout << "load model done!\n";
// create graph
graph = create_runtime_graph("graph", model_name, NULL);
if (!check_graph_valid(graph)){
std::cout << "create graph0 failed\n";
return 1;
}
pthread_mutex_init(&m_frame, NULL);
pthread_mutex_init(&m_outdata, NULL);
pthread_mutex_init(&m_quit, NULL);
pthread_t id1, id2;
pthread_create(&id1, NULL, th_vedio, NULL);
pthread_create(&id2, NULL, th_detect, NULL);
pthread_join(id1, NULL);
pthread_join(id2, NULL);
pthread_mutex_destroy(&m_frame);
pthread_mutex_destroy(&m_outdata);
pthread_mutex_destroy(&m_quit);
postrun_graph(graph);
destroy_runtime_graph(graph);
remove_model(model_name);
return 0;
}
修改完后即可编译运行了:
./MSSD
/home/dolphin/work/tengine/examples/test_mssd/MSSD
proto file not specified,using ../../models/MobileNetSSD_deploy.prototxt by default
model file not specified,using ../../models/MobileNetSSD_deploy.caffemodel by default
load model done!
detect result num: 0
--------------------------------------
repeat 1 times, avg time per run is 134.089 ms
--------------------------------------
repeat 1 times, avg time per run is 120.139 ms
--------------------------------------
repeat 1 times, avg time per run is 124.079 ms
detect result num: 7
aeroplane :85%
BOX:( 454.456 , 65.2393 ),( 647.656 , 150.583 )
motorbike :82%
BOX:( 769.058 , 193.725 ),( 1017.29 , 404.13 )
diningtable :80%
BOX:( -10.7789 , 273.882 ),( 1049.22 , 603.307 )
chair :70%
BOX:( 235.559 , 190.266 ),( 401.388 , 467.526 )
bird :69%
BOX:( 564.433 , 367.69 ),( 808.597 , 480.245 )
person :36%
BOX:( 796.151 , 190.622 ),( 985.793 , 380.198 )
pottedplant :31%
BOX:( 2.61609 , 2.01261 ),( 278.975 , 341.759 )
Tengine的GPU/CPU异构调度
根据算子对计算图(图表)进行切分,切分的子图(子图)再通过调度器分配给相应的设备。由于GPU的编程较复杂,会优先支持神经网络中的常用算子(例如:CONV,POOL,FC等),而对于某些网络中特有的算子(例如检测网络SSD中的PRIORBOX等),就会分配给CPU进行计算。
Tengine在RK3399上做了异构的处理,可以充分发挥RK3399的运算能力,提升推理速度。
RK3399的GPU为Mali-T860,CPU包括:双核Cortex-A72+四核Cortex-A53。
为了发挥GPU的最高性能,需要设置GPU的频率到最高频率:
sudo su
echo “performance” > /sys/devices/platform/ff9a0000.gpu/devfreq/ff9a0000.gpu/governor
cat /sys/devices/platform/ff9a0000.gpu/devfreq/ff9a0000.gpu/cur_freq
800000000
编译有关项目
Tengine是通过调用Arm Compute Library(ACL)进行GPU加速,使用的ACL版本为18.05,从git上获取代码后,编译即可,并注意文件所在的路径,将在下一步操作中被引用:
git clone https://github.com/ARM-software/ComputeLibrary.git
git checkout v18.05
scons Werror = 1 -j4 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a
下载Tengine项目:
git clone https://github.com/OAID/Tengine.git
cp makefile.config.example makefile.config
vim makefile.config
在配置文件中打开ACL开关,并设定上一步操作时ACL的路径:
CONFIG_ACL_GPU=Y
ACL_ROOT=/home/dolphin/ComputeLibrary
编译安装:
make -j4
make install
下载MobilenetSSD模型,可以从Tengine_models | 百度云(提取码:57vb)
下载模型到tengine/models/路径下。
进入tengine目录下example文件夹中有一个mobilenet_ssd的子目录,打开CMakeLists.txt,在set( INSTALL_DIR ${TENGINE_DIR}/install/)
前增加一句设置TENGINE_DIR值的语句:
set( TENGINE_DIR ~/work/Tengine )
cmake完成自动配置后,运行make
来编译:
cmake .
make
执行时需要设置一些环境变量:
export GPU_CONCAT=0#禁用gpu run concat,避免cpu和gpu之间频繁的数据传输
export ACL_FP16=1#支持GPU用float16的数据格式进行推理计算
export REPEAT_COUNT=100#让算法重复执行100次,取平均时间作为性能数据;
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ ./MSSD
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 196.927 ms
detect result num: 3
dog :100%
BOX:( 138.509 , 209.394 ),( 324.57 , 541.314 )
car :100%
BOX:( 467.315 , 72.8045 ),( 687.269 , 171.128 )
bicycle :100%
BOX:( 107.395 , 140.657 ),( 574.212 , 415.188 )
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ export TENGINE_CPU_LIST=4
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ ./MSSD
ENV SET: [4]
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 313.66 ms
detect result num: 3
dog :100%
BOX:( 138.509 , 209.394 ),( 324.57 , 541.314 )
car :100%
BOX:( 467.315 , 72.8045 ),( 687.269 , 171.128 )
bicycle :100%
BOX:( 107.395 , 140.657 ),( 574.212 , 415.188 )
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ export TENGINE_CPU_LIST=4,5
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ ./MSSD
ENV SET: [4,5]
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 241.372 ms
detect result num: 3
dog :100%
BOX:( 138.509 , 209.394 ),( 324.57 , 541.314 )
car :100%
BOX:( 467.315 , 72.8045 ),( 687.269 , 171.128 )
bicycle :100%
BOX:( 107.395 , 140.657 ),( 574.212 , 415.188 )
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ export TENGINE_CPU_LIST=0,1,2,3
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ ./MSSD
ENV SET: [0,1,2,3]
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 221.02 ms
detect result num: 3
dog :100%
BOX:( 138.509 , 209.394 ),( 324.57 , 541.314 )
car :100%
BOX:( 467.315 , 72.8045 ),( 687.269 , 171.128 )
bicycle :100%
BOX:( 107.395 , 140.657 ),( 574.212 , 415.188 )
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ unset TENGINE_CPU_LIST
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ export GPU_CONCAT=0
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ export ACL_FP16=1
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ taskset 0x4 ./MSSD -d acl_opencl
/home/pi/work/Tengine/examples/mobilenet_ssd/MSSD
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 202.103 ms
detect result num: 3
dog :100%
BOX:( 138.419 , 209.091 ),( 324.504 , 541.568 )
car :100%
BOX:( 467.356 , 72.9224 ),( 687.269 , 171.123 )
bicycle :100%
BOX:( 107.053 , 140.221 ),( 574.472 , 415.248 )
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ unset ACL_FP16
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ taskset 0x4 ./MSSD -d acl_opencl
/home/pi/work/Tengine/examples/mobilenet_ssd/MSSD
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 272.369 ms
detect result num: 3
dog :100%
BOX:( 138.509 , 209.394 ),( 324.57 , 541.314 )
car :100%
BOX:( 467.315 , 72.8045 ),( 687.269 , 171.128 )
bicycle :100%
BOX:( 107.395 , 140.657 ),( 574.212 , 415.188 )
执行的时候需要加-d acl_opencl来打开使用gpu的开关。
从下图可以看到,GPU用半浮点精度float16的检测结果是正确的。
以下是对比Tengine用纯CPU进行MobilenetSSD的推理计算的性能:
运行环境 | 运算时间 | 时间对比 |
---|---|---|
CPU:2A72+4A53 | 190.927 | 14% |
CPU:1A72 | 313.66 | -42% |
CPU:2A72 | 241.372 | -9% |
CPU:4A53 | 221.02 | 0% |
GPU:FP16+CPU:1A72 | 202.103 | 9% |
GPU:FP32+CPU:1A72 | 272.369 | -23% |
可以看出,通过GPU/CPU异构调度的性能大约是两个CPU大核A72的性能,或者4个A53的小核的性能,而用6个核的速度最快。