推理引擎tengine编译部署及MobileNet_SSD推理测试

Tengine介绍

OADI/Tengine | github
Tengine 是OPEN AI LAB 为嵌入式设备开发的一个轻量级、高性能并且模块化的引擎。
Tengine在嵌入式设备上支持CPU，GPU，DLA/NPU，DSP异构计算的计算框架，实现异构计算的调度器，基于ARM平台的高效的计算库实现，针对特定硬件平台的性能优化，动态规划计算图的内存使用，提供对于网络远端AI计算能力的访问支持，支持多级别并行，整个系统模块可拆卸，基于事件驱动的计算模型，吸取已有AI计算框架的优点，设计全新的计算图表示。

安装Tengine

官方安装指南

配置编译环境

sudo apt install git cmake
sudo apt install libprotobuf-dev protobuf-compiler libboost-all-dev libgoogle-glog-dev protobuf-compiler

下载并配置编译选项

git clone https://github.com/OAID/tengine.git
cd ~/tengine
cp makefile.config.example makefile.config
vim makefile.config

修改为以下内容：

# Set the target arch
CONFIG_ARCH_ARM64=y
# Enable Compiling Optimization
CONFIG_OPT_CFLAGS=-O2
# Use BLAS as the operator implementation
CONFIG_ARCH_BLAS=y
# Enable GPU support by Arm Computing Library
# CONFIG_ACL_GPU=y
# Set the path of ACL
# ACL_ROOT=/home/firefly/ComputeLibrary
# Enable other serializers
CONFIG_CAFFE_SERIALIZER=y
CONFIG_MXNET_SERIALIZER=y
CONFIG_ONNX_SERIALIZER=y
CONFIG_TF_SERIALIZER=y
CONFIG_TENGINE_SERIALIZER=y

链接OPENCV

tengine需要链接opencv，以下代码用于生成链接：

sudo vim /etc/ld.so.conf

在打开的文件上中增加opencv的安装路径：

/usr/local/lib

执行sudo ldconfig完成配置。
如果usr/local/lib下的opencv均为有版本号的“.so”文件，tengine会链接不到，需要修改，以下脚本用来生成“.so”的软链接：

# coding=utf-8
import os,sys
import shutil
import struct

file_list = []

def listdir(folder, file_list):
    fileNum = 0
    new_file_list = os.listdir(folder) 
    for line in new_file_list:
        filepath = os.path.join(folder,line)
        if os.path.isfile(filepath):
            #print(line)
            file_list.append(line)
            fileNum = fileNum + 1
#change .jpg.txt to .txt
def ChangeFileName(folder, file_list):
    for file_line in file_list:
        old_file_name = file_line
        new_file_name = file_line.replace(".so.3.4.5", ".so")
        #print("new: " + new_file_name)
        #print("old: " + old_file_name)
        if new_file_name != old_file_name:
            if os.path.exists(os.path.join(folder, new_file_name)):
                print("file exist: " + new_file_name)
            else:
                #print("file_name:" + old_file_name)
                cmd = "sudo ln -s "+ old_file_name + " " + new_file_name
                print(cmd)
                os.system("sudo ln -s "+ old_file_name + " " + new_file_name)
#folder = sys.argv[1]
folder = os.getcwd()
print(folder)
listdir(folder, file_list)
ChangeFileName(folder, file_list)

将文件存为“filename.py”，拷贝到/usr/local/lib下，通过sudo python3 filename.py运行，创建软链接，如图所示以.so结束的库文件链接到了.so.3.4.5的库文件：

-rw-r--r--  1 root root     365064 1月  31 06:28 libopencv_surface_matching.so.3.4.5
lrwxrwxrwx  1 root root         23 2月  19 00:09 libopencv_text.so -> libopencv_text.so.3.4.5
lrwxrwxrwx  1 root root         23 2月  15 06:00 libopencv_text.so.3.4 -> libopencv_text.so.3.4.5
-rw-r--r--  1 root root     428928 1月  31 07:06 libopencv_text.so.3.4.5
lrwxrwxrwx  1 root root         27 2月  19 00:08 libopencv_tracking.so -> libopencv_tracking.so.3.4.5
lrwxrwxrwx  1 root root         27 2月  15 06:00 libopencv_tracking.so.3.4 -> libopencv_tracking.so.3.4.5
-rw-r--r--  1 root root    2336240 1月  31 07:38 libopencv_tracking.so.3.4.5
lrwxrwxrwx  1 root root         26 2月  19 00:33 libopencv_videoio.so -> libopencv_videoio.so.3.4.5
lrwxrwxrwx  1 root root         26 2月  15 06:00 libopencv_videoio.so.3.4 -> libopencv_videoio.so.3.4.5
-rw-r--r--  1 root root     369296 1月  31 05:55 libopencv_videoio.so.3.4.5
lrwxrwxrwx  1 root root         24 2月  19 00:10 libopencv_video.so -> libopencv_video.so.3.4.5
lrwxrwxrwx  1 root root         24 2月  15 06:00 libopencv_video.so.3.4 -> libopencv_video.so.3.4.5
-rw-r--r--  1 root root     423112 1月  31 06:29 libopencv_video.so.3.4.5
lrwxrwxrwx  1 root root         28 2月  19 03:53 libopencv_videostab.so -> libopencv_videostab.so.3.4.5
lrwxrwxrwx  1 root root         28 2月  15 06:00 libopencv_videostab.so.3.4 -> libopencv_videostab.so.3.4.5
-rw-r--r--  1 root root     365104 1月  31 07:40 libopencv_videostab.so.3.4.5
lrwxrwxrwx  1 root root         30 2月  18 15:08 libopencv_xfeatures2d.so -> libopencv_xfeatures2d.so.3.4.5
lrwxrwxrwx  1 root root         30 2月  15 06:00 libopencv_xfeatures2d.so.3.4 -> libopencv_xfeatures2d.so.3.4.5
-rw-r--r--  1 root root    2836736 1月  31 07:43 libopencv_xfeatures2d.so.3.4.5
lrwxrwxrwx  1 root root         27 2月  19 00:11 libopencv_ximgproc.so -> libopencv_ximgproc.so.3.4.5
lrwxrwxrwx  1 root root         27 2月  15 06:00 libopencv_ximgproc.so.3.4 -> libopencv_ximgproc.so.3.4.5
-rw-r--r--  1 root root    1299184 1月  31 07:48 libopencv_ximgproc.so.3.4.5
lrwxrwxrwx  1 root root         29 2月  19 00:11 libopencv_xobjdetect.so -> libopencv_xobjdetect.so.3.4.5
lrwxrwxrwx  1 root root         29 2月  15 06:00 libopencv_xobjdetect.so.3.4 -> libopencv_xobjdetect.so.3.4.5
-rw-r--r--  1 root root      99232 1月  31 07:51 libopencv_xobjdetect.so.3.4.5
lrwxrwxrwx  1 root root         25 2月  19 03:53 libopencv_xphoto.so -> libopencv_xphoto.so.3.4.5
lrwxrwxrwx  1 root root         25 2月  15 06:00 libopencv_xphoto.so.3.4 -> libopencv_xphoto.so.3.4.5
-rw-r--r--  1 root root     242880 1月  31 06:32 libopencv_xphoto.so.3.4.5

编译测试

在tengine目录下编译，编译完后运行测试程序进行测试：

sudo make
sudo make install
./build/tests/bin/bench_sqz
run-time library version: 1.0.0-github
REPEAT COUNT= 100
Repeat [100] time 55990.35 us per RUN. used 5599035 us
0.2763 - "n02123045 tabby, tabby cat"
0.2673 - "n02123159 tiger cat"
0.1766 - "n02119789 kit fox, Vulpes macrotis"
0.0827 - "n02124075 Egyptian cat"
0.0777 - "n02085620 Chihuahua"
--------------------------------------
ALL TEST DONE
./build/tests/bin/bench_mobilenet
run-time library version: 1.0.0-github
REPEAT COUNT= 100
Repeat [100] time 56649.14 us per RUN. used 5664914 us
8.5976 - "n02123159 tiger cat"
7.9550 - "n02119022 red fox, Vulpes vulpes"
7.8679 - "n02119789 kit fox, Vulpes macrotis"
7.4274 - "n02113023 Pembroke, Pembroke Welsh corgi"
6.3646 - "n02123045 tabby, tabby cat"
ALL TEST DONE

运行Tengine自带的MobileNet SSD

编译测试代码

在tengine目录下example文件夹中有一个mobilenet_ssd的子目录，打开CMakeLists.txt，在set( INSTALL_DIR ${TENGINE_DIR}/install/)前增加一句设置TENGINE_DIR值的语句：

set( TENGINE_DIR ~/work/Tengine )

下载模型文件

Tengine提供了模型下载——Tengine_models|百度云(提取码：57vb)
找到mobilenet_ssd文件夹把其中的MobileNetSSD_deploy.prototxt和MobileNetSSD_deploy.caffemodel下载下来放到./models目录下就行，以下代码用于将下载在ftp文件夹里的模型传到小机中。

wget ftp://192.168.199.1/sda1/MobileNetSSD_deploy.prototxt --ftp-user=root --ftp-password="password"
wget ftp://192.168.199.1/sda1/MobileNetSSD_deploy.caffemodel --ftp-user=root --ftp-password="password"

编译

编译MobileNet SSD示例：

cd ~/work/Tengine/examples/mobilenet_ssd
cmake .
make
./MSSD -i test.jpg
/home/dolphin/work/tengine/examples/mobilenet_ssd/MSSD
proto file not specified,using /home/dolphin/work/tengine/models/MobileNetSSD_deploy.prototxt by default
model file not specified,using /home/dolphin/work/tengine/models/MobileNetSSD_deploy.caffemodel by default
--------------------------------------
repeat 1 times, avg time per run is 118.913 ms
detect result num: 6
dog     :99%
BOX:( 322.588 , 232.231 ),( 455.996 , 330.833 )
person  :99%
BOX:( 213.043 , 153.082 ),( 310.846 , 322.655 )
person  :96%
BOX:( 536.058 , 76.8777 ),( 709.835 , 391.781 )
dog     :90%
BOX:( 177.256 , 296.386 ),( 258.995 , 461.81 )
person  :89%
BOX:( 499.474 , 72.645 ),( 619.208 , 369.286 )
person  :74%
BOX:( 149.663 , 130.89 ),( 217.314 , 245.324 )
======================================
[DETECTED IMAGE SAVED]: save.jpg
======================================

运行后会存储一张加框的图片“save.jpg”。

运行MobileNet SSD并检测视频

修改mssd.cpp的代码：

#include <unistd.h>
#include <iostream>
#include <iomanip>
#include <string>
#include <vector>
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include "tengine_c_api.h"
#include <sys/time.h>
#include <stdio.h>
#include "common.hpp"
#include <pthread.h>
#include <sched.h>

#define DEF_PROTO "../../models/MobileNetSSD_deploy.prototxt"
#define DEF_MODEL "../../models/MobileNetSSD_deploy.caffemodel"

struct Box
{
    float x0;
    float y0;
    float x1;
    float y1;
    int class_idx;
    float score;
};

void get_input_data_ssd(cv::Mat img, float* input_data, int img_h,  int img_w){
    if (img.empty()){
        std::cerr << "Failed to read image from camera.\n";
        return;
    }
   
    cv::resize(img, img, cv::Size(img_h, img_w));
    img.convertTo(img, CV_32FC3);
    float *img_data = (float *)img.data;
    int hw = img_h * img_w;

    float mean[3]={127.5,127.5,127.5};
    for (int h = 0; h < img_h; h++){
        for (int w = 0; w < img_w; w++){
            for (int c = 0; c < 3; c++){
                input_data[c * hw + h * img_w + w] = 0.007843* (*img_data - mean[c]);
                img_data++;
            }
        }
    }
}

void post_process_ssd(cv::Mat img, float threshold,float* outdata,int num){
#if 0
    const char* class_names[] = {"background",
                    "airplane", "bicycle", "bird", "boat",
                    "bus", "car", "chair", "dog", "motorcycle",
                    "panther", "tiger"};
#else
    const char* class_names[] = {"background",   "aeroplane",   "bicycle",   "bird",   "boat",   "bottle",
                     "bus",   "car",   "cat",   "chair",  "cow",   "diningtable",
                     "dog",   "horse",   "motorbike",   "person",   "pottedplant",   "sheep",
                     "sofa",   "train",   "tvmonitor"};
#endif  
    int raw_h = img.size().height;
    int raw_w = img.size().width;
    std::vector<Box> boxes;
    int line_width=raw_w*0.002;
    printf("detect result num: %d \n",num);
    for (int i=0;i<num;i++){
        if(outdata[1]>=threshold){
            Box box;
            box.class_idx=outdata[0];
            box.score=outdata[1];
            box.x0=outdata[2]*raw_w;
            box.y0=outdata[3]*raw_h;
            box.x1=outdata[4]*raw_w;
            box.y1=outdata[5]*raw_h;
            boxes.push_back(box);
            printf("%s\t:%.0f%%\n", class_names[box.class_idx], box.score * 100);
            printf("BOX:( %g , %g ),( %g , %g )\n",box.x0,box.y0,box.x1,box.y1);
        }
        outdata+=6;
    }
#if 0
    for(int i=0;i<(int)boxes.size();i++){
        Box box=boxes[i];
        cv::rectangle(img, cv::Rect(box.x0, box.y0,(box.x1-box.x0),(box.y1-box.y0)),cv::Scalar(255, 255, 0),line_width);
        std::ostringstream score_str;
        score_str<<box.score;
        std::string label = std::string(class_names[box.class_idx]) + ": " + score_str.str();
        int baseLine = 0;
        cv::Size label_size = cv::getTextSize(label, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
        cv::rectangle(img, cv::Rect(cv::Point(box.x0,box.y0- label_size.height),
                                  cv::Size(label_size.width, label_size.height + baseLine)),
                      cv::Scalar(255, 255, 0), CV_FILLED);
        cv::putText(img, label, cv::Point(box.x0, box.y0),
                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));
    }
#endif
}

float outdata[15*6];
cv::Mat frame;
int detect_num;
bool quit_flag = false;
graph_t graph;

pthread_mutex_t m_frame, m_outdata, m_quit;
void *th_vedio(void *){
    //cv::VideoCapture capture(0);     // usb camera
    cv::VideoCapture capture("test.mp4");    // vedio
    capture.set(CV_CAP_PROP_FRAME_WIDTH, 960);
    capture.set(CV_CAP_PROP_FRAME_HEIGHT, 540);
#if 0
    cv::namedWindow("MSSD", CV_WINDOW_NORMAL);
    cvResizeWindow("MSSD", 1280, 720);
#endif
    while(1){
        float show_threshold=0.25;
        pthread_mutex_lock(&m_frame);
        capture >> frame;
        pthread_mutex_lock(&m_outdata);
        post_process_ssd(frame, show_threshold, outdata, detect_num);
        pthread_mutex_unlock(&m_outdata);
#if 0
    cv::imshow("MSSD", frame);
#endif
    pthread_mutex_unlock(&m_frame);
        if( cv::waitKey(10) == 'q' ){
            pthread_mutex_lock(&m_quit);
            quit_flag = true;
            pthread_mutex_unlock(&m_quit);
            break;
        }
    usleep(500000);
    }
}

void *th_detect(void*){
    // input
    int img_h = 300;
    int img_w = 300;
    int img_size = img_h * img_w * 3;
    float *input_data = (float *)malloc(sizeof(float) * img_size);

    int node_idx=0;
    int tensor_idx=0;
    tensor_t input_tensor = get_graph_input_tensor(graph, node_idx, tensor_idx);
    if(!check_tensor_valid(input_tensor)){
        printf("Get input node failed : node_idx: %d, tensor_idx: %d\n",node_idx,tensor_idx);
        return NULL;
    }

    int dims[] = {1, 3, img_h, img_w};
    set_tensor_shape(input_tensor, dims, 4);
    prerun_graph(graph);

    int repeat_count = 1;
    const char *repeat = std::getenv("REPEAT_COUNT");

    if (repeat)
        repeat_count = std::strtoul(repeat, NULL, 10);

    int out_dim[4];
    tensor_t out_tensor;
    while(1){
        pthread_mutex_lock(&m_quit);
        if(quit_flag)  break;
        pthread_mutex_unlock(&m_quit);

        struct timeval t0, t1;
        float total_time = 0.f;

        for (int i = 0; i < repeat_count; i++){
            pthread_mutex_lock(&m_frame);
            get_input_data_ssd(frame, input_data, img_h,  img_w);
            pthread_mutex_unlock(&m_frame);

            gettimeofday(&t0, NULL);
            set_tensor_buffer(input_tensor, input_data, img_size * 4);
            run_graph(graph, 1);

            gettimeofday(&t1, NULL);
            float mytime = (float)((t1.tv_sec * 1000000 + t1.tv_usec) - (t0.tv_sec * 1000000 + t0.tv_usec)) / 1000;
            total_time += mytime;
        }
        std::cout << "--------------------------------------\n";
        std::cout << "repeat " << repeat_count << " times, avg time per run is " << total_time / repeat_count << " ms\n";

        out_tensor = get_graph_output_tensor(graph, 0,0);
        get_tensor_shape( out_tensor, out_dim, 4);
        pthread_mutex_lock(&m_outdata);
        detect_num = out_dim[1] <= 15 ? out_dim[1] : 15;
        memcpy(outdata, get_tensor_buffer(out_tensor), sizeof(float)*detect_num*6);
        pthread_mutex_unlock(&m_outdata);
    }

    free(input_data);
}

int main(int argc, char *argv[])
{
    const std::string root_path = get_root_path();
    std::string proto_file;
    std::string model_file;

    int res;
    while( ( res=getopt(argc,argv,"p:m:h"))!= -1){
        switch(res){
            case 'p':
                proto_file=optarg;
                break;
            case 'm':
                model_file=optarg;
                break;
            case 'h':
                std::cout << "[Usage]: " << argv[0] << " [-h]\n"
                          << "   [-p proto_file] [-m model_file]\n";
                return 0;
            default:
                break;
        }
    }

    const char *model_name = "mssd_300";
    if(proto_file.empty()){
        proto_file = DEF_PROTO;
        std::cout<< "proto file not specified,using "<< proto_file << " by default\n";

    }
    if(model_file.empty()){
        model_file = DEF_MODEL;
        std::cout<< "model file not specified,using "<< model_file << " by default\n";
    }

    // init tengine
    init_tengine_library();
    if (request_tengine_version("0.1") < 0)
        return 1;
    if (load_model(model_name, "caffe", proto_file.c_str(), model_file.c_str()) < 0)
        return 1;
    std::cout << "load model done!\n";
   
    // create graph
    graph = create_runtime_graph("graph", model_name, NULL);
    if (!check_graph_valid(graph)){
        std::cout << "create graph0 failed\n";
        return 1;
    }

    pthread_mutex_init(&m_frame, NULL);
    pthread_mutex_init(&m_outdata, NULL);
    pthread_mutex_init(&m_quit, NULL);


    pthread_t id1, id2;
    pthread_create(&id1, NULL, th_vedio, NULL);
    pthread_create(&id2, NULL, th_detect, NULL);

    pthread_join(id1, NULL);
    pthread_join(id2, NULL);

    pthread_mutex_destroy(&m_frame);
    pthread_mutex_destroy(&m_outdata);
    pthread_mutex_destroy(&m_quit);

    postrun_graph(graph);
    destroy_runtime_graph(graph);
    remove_model(model_name);

    return 0;
}

修改完后即可编译运行了：

./MSSD
/home/dolphin/work/tengine/examples/test_mssd/MSSD
proto file not specified,using ../../models/MobileNetSSD_deploy.prototxt by default
model file not specified,using ../../models/MobileNetSSD_deploy.caffemodel by default
load model done!
detect result num: 0
--------------------------------------
repeat 1 times, avg time per run is 134.089 ms
--------------------------------------
repeat 1 times, avg time per run is 120.139 ms
--------------------------------------
repeat 1 times, avg time per run is 124.079 ms
detect result num: 7
aeroplane       :85%
BOX:( 454.456 , 65.2393 ),( 647.656 , 150.583 )
motorbike       :82%
BOX:( 769.058 , 193.725 ),( 1017.29 , 404.13 )
diningtable     :80%
BOX:( -10.7789 , 273.882 ),( 1049.22 , 603.307 )
chair   :70%
BOX:( 235.559 , 190.266 ),( 401.388 , 467.526 )
bird    :69%
BOX:( 564.433 , 367.69 ),( 808.597 , 480.245 )
person  :36%
BOX:( 796.151 , 190.622 ),( 985.793 , 380.198 )
pottedplant     :31%
BOX:( 2.61609 , 2.01261 ),( 278.975 , 341.759 )

Tengine的GPU/CPU异构调度

根据算子对计算图（图表）进行切分，切分的子图（子图）再通过调度器分配给相应的设备。由于GPU的编程较复杂，会优先支持神经网络中的常用算子（例如：CONV，POOL，FC等），而对于某些网络中特有的算子（例如检测网络SSD中的PRIORBOX等），就会分配给CPU进行计算。

Tengine在RK3399上做了异构的处理，可以充分发挥RK3399的运算能力，提升推理速度。
RK3399的GPU为Mali-T860，CPU包括：双核Cortex-A72+四核Cortex-A53。
为了发挥GPU的最高性能，需要设置GPU的频率到最高频率：

sudo su
echo “performance” > /sys/devices/platform/ff9a0000.gpu/devfreq/ff9a0000.gpu/governor
cat /sys/devices/platform/ff9a0000.gpu/devfreq/ff9a0000.gpu/cur_freq
800000000

编译有关项目

Tengine是通过调用Arm Compute Library（ACL）进行GPU加速，使用的ACL版本为18.05，从git上获取代码后，编译即可，并注意文件所在的路径，将在下一步操作中被引用：

git clone https://github.com/ARM-software/ComputeLibrary.git
git checkout v18.05
scons Werror = 1 -j4 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a

下载Tengine项目：

git clone https://github.com/OAID/Tengine.git
cp makefile.config.example makefile.config
vim makefile.config

在配置文件中打开ACL开关，并设定上一步操作时ACL的路径：

CONFIG_ACL_GPU=Y
ACL_ROOT=/home/dolphin/ComputeLibrary

编译安装：

make -j4
make install

下载MobilenetSSD模型，可以从Tengine_models | 百度云（提取码：57vb）
下载模型到tengine/models/路径下。
进入tengine目录下example文件夹中有一个mobilenet_ssd的子目录，打开CMakeLists.txt，在set( INSTALL_DIR ${TENGINE_DIR}/install/)前增加一句设置TENGINE_DIR值的语句：

set( TENGINE_DIR ~/work/Tengine )

cmake完成自动配置后，运行make来编译：

cmake . 
make

执行时需要设置一些环境变量：

export GPU_CONCAT=0＃禁用gpu run concat，避免cpu和gpu之间频繁的数据传输
export ACL_FP16=1＃支持GPU用float16的数据格式进行推理计算
export REPEAT_COUNT=100#让算法重复执行100次，取平均时间作为性能数据;
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ ./MSSD
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 196.927 ms
detect result num: 3
dog     :100%
BOX:( 138.509 , 209.394 ),( 324.57 , 541.314 )
car     :100%
BOX:( 467.315 , 72.8045 ),( 687.269 , 171.128 )
bicycle :100%
BOX:( 107.395 , 140.657 ),( 574.212 , 415.188 )
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ export TENGINE_CPU_LIST=4
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ ./MSSD
ENV SET: [4]
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 313.66 ms
detect result num: 3
dog     :100%
BOX:( 138.509 , 209.394 ),( 324.57 , 541.314 )
car     :100%
BOX:( 467.315 , 72.8045 ),( 687.269 , 171.128 )
bicycle :100%
BOX:( 107.395 , 140.657 ),( 574.212 , 415.188 )
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ export TENGINE_CPU_LIST=4,5
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ ./MSSD
ENV SET: [4,5]
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 241.372 ms
detect result num: 3
dog     :100%
BOX:( 138.509 , 209.394 ),( 324.57 , 541.314 )
car     :100%
BOX:( 467.315 , 72.8045 ),( 687.269 , 171.128 )
bicycle :100%
BOX:( 107.395 , 140.657 ),( 574.212 , 415.188 )
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ export TENGINE_CPU_LIST=0,1,2,3
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ ./MSSD
ENV SET: [0,1,2,3]
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 221.02 ms
detect result num: 3
dog     :100%
BOX:( 138.509 , 209.394 ),( 324.57 , 541.314 )
car     :100%
BOX:( 467.315 , 72.8045 ),( 687.269 , 171.128 )
bicycle :100%
BOX:( 107.395 , 140.657 ),( 574.212 , 415.188 )
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ unset TENGINE_CPU_LIST
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ export GPU_CONCAT=0
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ export ACL_FP16=1
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ taskset 0x4 ./MSSD -d acl_opencl
/home/pi/work/Tengine/examples/mobilenet_ssd/MSSD
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 202.103 ms
detect result num: 3
dog     :100%
BOX:( 138.419 , 209.091 ),( 324.504 , 541.568 )
car     :100%
BOX:( 467.356 , 72.9224 ),( 687.269 , 171.123 )
bicycle :100%
BOX:( 107.053 , 140.221 ),( 574.472 , 415.248 )
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ unset ACL_FP16
pi@NanoPi-NEO4:~/work/Tengine/examples/mobilenet_ssd$ taskset 0x4 ./MSSD -d acl_opencl
/home/pi/work/Tengine/examples/mobilenet_ssd/MSSD
ACL Graph Initialized
Driver: ACLGraph probed 1 devices
repeat 100 times, avg time per run is 272.369 ms
detect result num: 3
dog     :100%
BOX:( 138.509 , 209.394 ),( 324.57 , 541.314 )
car     :100%
BOX:( 467.315 , 72.8045 ),( 687.269 , 171.128 )
bicycle :100%
BOX:( 107.395 , 140.657 ),( 574.212 , 415.188 )

执行的时候需要加-d acl_opencl来打开使用gpu的开关。
从下图可以看到，GPU用半浮点精度float16的检测结果是正确的。

以下是对比Tengine用纯CPU进行MobilenetSSD的推理计算的性能：

运行环境	运算时间	时间对比
CPU:2A72+4A53	190.927	14%
CPU:1A72	313.66	-42%
CPU:2A72	241.372	-9%
CPU:4A53	221.02	0%
GPU:FP16+CPU:1A72	202.103	9%
GPU:FP32+CPU:1A72	272.369	-23%

可以看出，通过GPU/CPU异构调度的性能大约是两个CPU大核A72的性能，或者4个A53的小核的性能，而用6个核的速度最快。