如何使用Ascend的ATB加速库?

1 前言

Ascend Transformer Boost加速库(下文简称为ATB加速库)是一款高效、可靠的加速库,基于华为Ascend AI处理器,专门为Transformer类模型的训练和推理而设计。具体请阅读:ATB是什么?

那么程序猿小白如何实现一个ATB算子呢?

2 具体实现一个ATB算子

以下内容参考:

算子使用指导-加速库使用指导-Ascend Transformer Boost加速库-领域加速库开发-CANN商用版8.0.RC2.2开发文档-昇腾社区

实现一个ATB算子大概要有以下10个步骤,如下图所示。


image.png

step 1: 包含ACL与加速库接口头文件

#include <acl/acl.h>
#include <atb/atb_infer.h>
#include <atb/types.h>
#include <atb/utils.h>
#include "atb/infer_op_params.h"

这里要注意:

  • 首先要安装atb相关的so文件,才能获取到相关头文件,保证程序链接不出错。
  • 不同的算子,可能包含的头文件并不相同。
  • 其它头文件,自定义添加

参考:

安装部署-Ascend Transformer Boost加速库-领域加速库开发-CANN商用版8.0.RC2.2开发文档-昇腾社区

step 2: 配置deviceId

uint32_t deviceId = 0;
aclError status = aclrtSetDevice(deviceId);

根据需求设置deviceId,如单机多卡,asecnd可用的deviceId为0-7(总共8张卡)。

step 3: 创建算子对象实例
从前文ATB是什么? ATB总共有3种算子实现,下文分别进行说明。

1、基础Operation(原生算子)

第一步:构造Operation参数

根据要创建的算子,实例化参数结构体,参数结构体的接口定义参考atb/infer_op_params.h和atb/train_op_params.h。

以Mul算子为例,Mul算子归属于Elewise,可通过以下方式构造对应参数:

atb::infer::ElewiseParam mulParam;
mulParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_MUL;

第二步:创建算子对象实例

atb::Operation *op = nullptr;
atb::Status st = atb::CreateOperation(mulParam, &op);

2、插件(Plugin)机制(插件算子)

插件算子需要是使用Ascend c或者其它方式实现kernel。

建议直接本文3.2章节。

参考:

插件机制-ATB算子

第一步:开发算子

以使用Ascend C创建Add算子为例,用户可根据实际需求选择其他方式实现自定义算子。

参考如下:kernel_add.cpp

plugin_op_demo/kernel/kernel_add.cpp · Si1verBul1et623548/atb-op-demo - 码云 - 开源中国 (gitee.com)

第二步:创建算子对象实例

CustomOperation*op = new CustomOperation("CustomOperation");

3、Graph Frame(图算子)

图算子有配置TensorId和配置TensorName组图两种创建和使用方式。

根据如下图算子结构图:


image.png

可以明确出,TensorId与TensorName对应关系配置如下:


image.png

表1 TensorId与TensorName对应关系配置
组图方式1:配置TensorId

第一步:构造Operation参数

与单算子的参数不同,图算子的参数包含图节点、输入Tensor数、输出Tensor数、中间Tensor数等图相关的信息。

首先,根据设计的图算子结构,分别计算出图输入Tensor(假设为x个),图输出Tensor(假设为y个)以及图中间Tensor(假设为z个)的个数。 图输入Tensor的Id取值为[0, x - 1],图输出Tensor的Id取值为[x, x + y - 1],图中间Tensor的Id取值为[x + y, x + y + z - 1]。示例对应关系见表1Tensor与TensorId列。

然后,配置每一个节点的相关信息,包括创建好的单算子对象实例、输入Tensor和输出Tensor。该节点的输入和输出Tensor在图里可能是图的输入Tensor、输出Tensor或中间Tensor,用户需根据其所属的图Tensor类型,在合适的范围内取值。

实例中的op0和op1创建过程可参考单算子的创建。

atb::GraphParam graphParam;
graphParam.inTensorNum = 3;                 // 指定该图的输入Tensor数量
graphParam.outTensorNum = 1;                // 指定该图的输出Tensor数量
graphParam.internalTensorNum = 1;           // 指定该图的中间Tensor数量
graphParam.nodes.resize(2);                 // 指定该图中的节点数量,即包含的单算子数量
graphParam.nodes[0].operation = op0;        // 指定该图中的节点0的单算子对象实例
graphParam.nodes[0].inTensorIds = {0, 1};   // 指定该图中的节点0需要的输入Tensor所对应的id
graphParam.nodes[0].outTensorIds = {4};     // 指定该图中的节点0输出的输出Tensor所对应的id
graphParam.nodes[1].operation = op1;        // 指定该图中的节点1的单算子对象实例
graphParam.nodes[1].inTensorIds = {4, 2};   // 指定该图中的节点1需要的输入Tensor所对应的id
graphParam.nodes[1].outTensorIds = {3};     // 指定该图中的节点1输出的输出Tensor所对应的id

第二步:创建算子对象实例

atb::Operation *op = nullptr;
atb::Status st = atb::CreateOperation(graphParam, &op);

组图方式2:配置TensorId

使用TensorId组图需要提前定义,操作过程繁琐。该组图通过字符串定义每个Tensor,可行性更高。示例对应关系见上表1种Tensor与TensorName。

第一步:创建图算子构造器

atb::GraphOpBuilder* graphOpBuilder;
CreateGraphOpBuilder(&graphOpBuilder);

第二步:初始化图算子构造器

// lambda函数,通过图算子的输入TensorDesc推导输出TensorDesc,包括DataType、Format、Shape等
atb::InferShapeFunc inferShapeFunc = [=](const atb::SVector<atb::TensorDesc> &inTensorDescs, atb::SVector<atb::TensorDesc> &outTensorDescs) {
    outTensorDescs.at(0) = inTensorDescs.at(0);
    return atb::NO_ERROR;
};
graphOpBuilder->Init("DemoGraphOperation", inferShapeFunc, {"a", "b", "c"}, {"output"});

第三步:用图算子构造器构图

构图时可通过定义lambda函数对Tensor进行reshape,需保证reshape前后的shape大小一致。

op0等单算子的创建过程可参考上述单算子的创建。

graphOpBuilder->AddOperation(op0, {"a", "b"}, {"a_add_b_output"});
graphOpBuilder->AddOperation(op1, {"a_add_b_output", "c"}, {"output"});

第四步:用图算子构造器构图

atb::Operation *op = graphOpBuilder->Build(); // 使用时需判断op是否为空指针
DestroyGraphOpBuilder(graphOpBuilder); // 销毁图算子构造器

step 4: 创建输入输出tensor,并存入VariantPack
VariantPack中包含输入和输出Tensor列表。VariantPack中传入的每个输入Tensor要求大于0且不超过256GB。

// 设置各个intensor的属性
void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) 
{
    for (size_t i = 0; i < intensorDescs.size(); i++) {
        intensorDescs.at(i).dtype = ACL_FLOAT16;
        intensorDescs.at(i).format = ACL_FORMAT_ND;
        intensorDescs.at(i).shape.dimNum = 2;
        intensorDescs.at(i).shape.dims[0] = 2;
        intensorDescs.at(i).shape.dims[1] = 2;
    }
}

// 设置各个intensor并且为各个intensor分配内存空间,此处的intensor为手动设置,工程实现上可以使用torchTensor转换或者其他简单数据结构转换的方式
void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs)
{
    std::vector<char> zeroData(8, 0); // 一段全0的hostBuffer
    for (size_t i = 0; i < inTensors.size(); i++) {
        inTensors.at(i).desc = intensorDescs.at(i);
        inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i));
        int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU内存
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
        ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, zeroData.data(), zeroData.size(), ACL_MEMCPY_HOST_TO_DEVICE); //拷贝CPU内存到NPU侧
    }
}

// 设置各个outtensor并且为outtensor分配内存空间,同intensor设置
void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs)
{
    for (size_t i = 0; i < outTensors.size(); i++) {
        outTensors.at(i).desc = outtensorDescs.at(i);
        outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i));
        int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }
}
// 按上述方法构造所有输入和输出tensor,存入VariantPack
atb::VariantPack pack;
atb::SVector<atb::TensorDesc> intensorDescs;
atb::SVector<atb::TensorDesc> outtensorDescs;

uint32_t inTensorNum = op->GetInputNum();
uint32_t outTensorNum = op->GetOutputNum();
pack.inTensors.resize(inTensorNum);
intensorDescs.resize(inTensorNum);

CreateInTensorDescs(intensorDescs);
CreateInTensors(pack.inTensors, intensorDescs);
    
outtensorDescs.resize(outTensorNum);
pack.outTensors.resize(outTensorNum);
op->InferShape(intensorDescs, outtensorDescs);
CreateOutTensors(pack.outTensors, outtensorDescs);

step 5: 创建context,配置stream
Context主要负责对NPU中使用的Stream进行管理。

atb::Context *context = nullptr;
st = atb::CreateContext(&context);

aclrtStream stream = nullptr;
status = aclrtCreateStream(&stream);
context->SetExecuteStream(stream);

step 6: 调用Setup接口,计算workspace大小

uint64_t workspaceSize = 0;
st = op->Setup(pack, workspaceSize, context);

step 7: 根据workspace大小申请NPU内存

void *workspace = nullptr;
if (workspaceSize != 0) {
    status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
    if (status != 0) {
        std::cout << "alloc error!";
        exit(0);
    }
}

当workspace大小为0时,无需执行该步骤,否则会报错。

step 8: 调用Execute接口,执行算子

st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);

step 9: 销毁创建的对象,释放内存

// 流同步,作用是等待device侧任务计算完成
auto ret = aclrtSynchronizeStream(stream);
if (ret != 0) {
    std::cout << "sync error!";
    exit(0);
}

status = aclrtDestroyStream(stream); // 销毁stream
st = atb::DestroyOperation(op);      // 销毁op对象
st = atb::DestroyContext(context);   // 销毁context
// 销毁输入tensor
for (size_t i = 0; i < pack.inTensors.size(); i++) {
    aclrtFree(pack.inTensors.at(i).deviceData);
}
// 销毁输出tensor
for (size_t i = 0; i < pack.outTensors.size(); i++) {
    aclrtFree(pack.outTensors.at(i).deviceData);
}
aclrtFree(pack.outTensors.at(0).deviceData); // 销毁输出tensor
status = aclrtFree(workspace);       // 销毁workspace
aclrtResetDevice(deviceId);          // 重置deviceId

step 10: demo运行
编译源文件:

# g++编译demo工程,demo.cpp为demo对应的源码文件
g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" demo.cpp -l atb -l ascendcl -o demo

这里:

ATB_HOME_PATH:指的是atb库文件的安装路径。

执行:

./demo # 运行可执行文件

3 完整代码文件

3.1 单算子完整示例

文件命名为atb_mul_operation.cpp

// step1:包含ACL与加速库接口头文件
#include <iostream>
#include <vector>
#include <acl/acl.h>
#include <atb/atb_infer.h>
#include <atb/types.h>
#include <atb/utils.h>
#include "atb/infer_op_params.h"


void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) 
{
    for (size_t i = 0; i < intensorDescs.size(); i++) {
        intensorDescs.at(i).dtype = ACL_FLOAT16;
        intensorDescs.at(i).format = ACL_FORMAT_ND;
        intensorDescs.at(i).shape.dimNum = 2;
        intensorDescs.at(i).shape.dims[0] = 2;
        intensorDescs.at(i).shape.dims[1] = 2;
    }
}

// 设置各个intensor并且为各个intensor分配内存空间,此处的intensor为手动设置,工程实现上可以使用torchTensor转换或者其他简单数据结构转换的方式
void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs)
{
    std::vector<char> zeroData(8, 0); // 一段全0的hostBuffer
    for (size_t i = 0; i < inTensors.size(); i++) {
        inTensors.at(i).desc = intensorDescs.at(i);
        inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i));
        int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU内存
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
        ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, zeroData.data(), zeroData.size(), ACL_MEMCPY_HOST_TO_DEVICE); //拷贝CPU内存到NPU侧
    }
}

// 设置各个outtensor并且为outtensor分配内存空间,同intensor设置
void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs)
{
    for (size_t i = 0; i < outTensors.size(); i++) {
        outTensors.at(i).desc = outtensorDescs.at(i);
        outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i));
        int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }
}

int main() {
    // step2:配置deviceId
    uint32_t deviceId = 0;
    aclError status = aclrtSetDevice(deviceId);

    // step3:创建算子对象实例,以Mul算子为例,Mul算子归属于Elewise,可通过以下方式构造对应参数
    // 第一步:构造Operation参数
    atb::infer::ElewiseParam mulParam;
    mulParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_MUL;

    // 第二步:创建算子对象实例
    atb::Operation *op = nullptr;
    atb::Status st = atb::CreateOperation(mulParam, &op);

    // step4:创建输入输出tensor,并存入VariantPack
    atb::VariantPack pack;
    atb::SVector<atb::TensorDesc> intensorDescs;
    atb::SVector<atb::TensorDesc> outtensorDescs;

    uint32_t inTensorNum = op->GetInputNum();
    uint32_t outTensorNum = op->GetOutputNum();
    pack.inTensors.resize(inTensorNum);
    intensorDescs.resize(inTensorNum);

    CreateInTensorDescs(intensorDescs);
    CreateInTensors(pack.inTensors, intensorDescs);
        
    outtensorDescs.resize(outTensorNum);
    pack.outTensors.resize(outTensorNum);
    op->InferShape(intensorDescs, outtensorDescs);
    CreateOutTensors(pack.outTensors, outtensorDescs);

    // step5:创建context,配置stream
    atb::Context *context = nullptr;
    st = atb::CreateContext(&context);

    aclrtStream stream = nullptr;
    status = aclrtCreateStream(&stream);
    context->SetExecuteStream(stream);

    // step6:调用Setup接口,计算workspace大小
    uint64_t workspaceSize = 0;
    st = op->Setup(pack, workspaceSize, context);

    // step7:根据workspace大小申请NPU内存
    void *workspace = nullptr;
    if (workspaceSize != 0) {
        status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (status != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }

    // step8:调用Execute接口,执行算子
    st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);

    // step9:销毁创建的对象,释放内存
    // 流同步,作用是等待device侧任务计算完成
    auto ret = aclrtSynchronizeStream(stream);
    if (ret != 0) {
        std::cout << "sync error!";
        exit(0);
    }

    status = aclrtDestroyStream(stream); // 销毁stream
    st = atb::DestroyOperation(op);      // 销毁op对象
    st = atb::DestroyContext(context);   // 销毁context
    // 销毁输入tensor
    for (size_t i = 0; i < pack.inTensors.size(); i++) {
        aclrtFree(pack.inTensors.at(i).deviceData);
    }
    // 销毁输出tensor
    for (size_t i = 0; i < pack.outTensors.size(); i++) {
        aclrtFree(pack.outTensors.at(i).deviceData);
    }
    status = aclrtFree(workspace);       // 销毁workspace
    aclrtResetDevice(deviceId);          // 重置deviceId

    return 0;
}

也可以参考:

single_op_demo/single_op_demo.cpp · Si1verBul1et623548/atb-op-demo - 码云 - 开源中国 (gitee.com)
编译运行:

# g++编译demo工程,demo.cpp为demo对应的源码文件
g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_mul_operation.cpp -l atb -l ascendcl -o atb_mul_operation


# 运行可执行文件
./atb_mul_operation

3.2 插件(Plugin)机制(插件算子)完整示例

参考:

Si1verBul1et623548/atb-op-demo
gitee.com/geyunqi/atb-op-demo/tree/master/plugin_op_demo

进入到plugin_op_demo目录后,执行

bash run.sh

在plugin_op_demo/build得到输出

total 68
drwxr-xr-x. 3 root root  4096 Sep 29 20:02 ./
drwxr-xr-x. 5 root root  4096 Sep 29 20:02 ../
-rw-r--r--. 1 root root 14543 Sep 29 20:02 CMakeCache.txt
drwxr-xr-x. 6 root root  4096 Sep 29 20:02 CMakeFiles/
-rw-r--r--. 1 root root  5773 Sep 29 20:02 Makefile
-rw-r--r--. 1 root root  1664 Sep 29 20:02 cmake_install.cmake
-rwxr-xr-x. 1 root root 27720 Sep 29 20:02 libplugin_add.so*

可见,当前编译为一个动态库so的形式。但是里面的过程,已经能够描述清楚为plugin的单算子怎么写。

3.3 Graph Frame(图算子)

3.3.1按照组图方式1:配置TensorId实现

image.png

文件命名为atb_add_graph_by_tensor_id.cpp

// step1:包含ACL与加速库接口头文件
#include <iostream>
#include <vector>
#include <acl/acl.h>
#include <atb/atb_infer.h>
#include <atb/types.h>
#include <atb/utils.h>
#include "atb/infer_op_params.h"


void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) 
{
    for (size_t i = 0; i < intensorDescs.size(); i++) {
        intensorDescs.at(i).dtype = ACL_FLOAT16;
        intensorDescs.at(i).format = ACL_FORMAT_ND;
        intensorDescs.at(i).shape.dimNum = 2;
        intensorDescs.at(i).shape.dims[0] = 2;
        intensorDescs.at(i).shape.dims[1] = 2;
    }
}

// 设置各个intensor并且为各个intensor分配内存空间,此处的intensor为手动设置,工程实现上可以使用torchTensor转换或者其他简单数据结构转换的方式
void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs)
{
    for (size_t i = 0; i < inTensors.size(); i++) {
        inTensors.at(i).desc = intensorDescs.at(i);
        inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i));
        std::vector<uint16_t> hostData(atb::Utils::GetTensorNumel(inTensors.at(i)), 2);   // 一段全2的hostBuffer
        int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU内存
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
        ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, hostData.data(), hostData.size() * sizeof(uint16_t), ACL_MEMCPY_HOST_TO_DEVICE); //拷贝CPU内存到NPU侧
    }
}

// 设置各个outtensor并且为outtensor分配内存空间,同intensor设置
void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs)
{
    for (size_t i = 0; i < outTensors.size(); i++) {
        outTensors.at(i).desc = outtensorDescs.at(i);
        outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i));
        int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }
}


//在构造图参数时,有两个点需要重点关注。一是Tensor的ID,ATB图接口中把Tensor分为三种类型,输入、输出和中间Tensor,顾名思义,输入输出Tensor是整图的输入输出Tensor,中间tensor则是在整图内的Tensor。构图时的TensorID从小到大应保证//为输入Tensor、输出Tensor、中间Tensor的顺序,且每一种Tensor的个数要与参数中设置的一致。二是要注意排布Node的顺序,用户需要根据计算图的拓扑结构把计算图变成一个有序队列,同时还要保证tensor与节点之间的关系和计算图保持一致。
void CreateGraphOperation(atb::GraphParam &opGraph, atb::Operation **operation)
{
    // 构图流程
    opGraph.inTensorNum = 4;
    opGraph.outTensorNum = 1;
    opGraph.internalTensorNum = 2;
    opGraph.nodes.resize(3);

    enum InTensorId {               //定义各TensorID
        IN_TENSOR_A = 0,
        IN_TENSOR_B,
        IN_TENSOR_C,
        IN_TENSOR_D,
        ADD3_OUT,
        ADD1_OUT,
        ADD2_OUT
    };

    size_t nodeId = 0;
    atb::Node &addNode = opGraph.nodes.at(nodeId++);
    atb::Node &addNode2 = opGraph.nodes.at(nodeId++);
    atb::Node &addNode3 = opGraph.nodes.at(nodeId++);

    atb::infer::ElewiseParam addParam;
    addParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD;
    atb::Status status = atb::CreateOperation(addParam, &addNode.operation);
    addNode.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B};
    addNode.outTensorIds = {ADD1_OUT};

    atb::infer::ElewiseParam addParam2;
    addParam2.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD;
    status = atb::CreateOperation(addParam2, &addNode2.operation);
    addNode2.inTensorIds = {IN_TENSOR_C, IN_TENSOR_D};
    addNode2.outTensorIds = {ADD2_OUT};

    atb::infer::ElewiseParam addParam3;
    addParam3.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD;
    status = CreateOperation(addParam3, &addNode3.operation);
    addNode3.inTensorIds = {ADD1_OUT, ADD2_OUT};
    addNode3.outTensorIds = {ADD3_OUT};

    status = atb::CreateOperation(opGraph, operation);
}

void PrintOutTensorValue(atb::Tensor &outTensor)
{
    // 输出Tensor拷贝回host侧并打印
    std::vector<uint16_t> outBuffer(atb::Utils::GetTensorNumel(outTensor));
    int ret = aclrtMemcpy(outBuffer.data(), outBuffer.size() * sizeof(uint16_t), outTensor.deviceData, outTensor.dataSize, ACL_MEMCPY_DEVICE_TO_HOST);
    if (ret != 0) {
        std::cout << "copy error!";
        exit(0);
    }
    for (size_t i = 0; i < outBuffer.size(); i = i + 1) {
        std::cout << "out[" << i << "] = " << (uint32_t)outBuffer.at(i) << std::endl;
    }
}

int main() {
    // step2:配置deviceId
    uint32_t deviceId = 0;
    aclError status = aclrtSetDevice(deviceId);

    // step3:创建图算子对象实例
    // 第一步:构造Operation参数
    atb::Operation *op = nullptr;
    atb::GraphParam opGraph;

    // 第二步:创建opGraph
    CreateGraphOperation(opGraph, &op);

    // step4:创建输入输出tensor,并存入VariantPack
    atb::VariantPack pack;
    atb::SVector<atb::TensorDesc> intensorDescs;
    atb::SVector<atb::TensorDesc> outtensorDescs;

    uint32_t inTensorNum = op->GetInputNum();
    uint32_t outTensorNum = op->GetOutputNum();
    pack.inTensors.resize(inTensorNum);
    intensorDescs.resize(inTensorNum);

    CreateInTensorDescs(intensorDescs);
    CreateInTensors(pack.inTensors, intensorDescs);
        
    outtensorDescs.resize(outTensorNum);
    pack.outTensors.resize(outTensorNum);
    op->InferShape(intensorDescs, outtensorDescs);
    CreateOutTensors(pack.outTensors, outtensorDescs);

    // step5:创建context,配置stream
    atb::Context *context = nullptr;
    auto st = atb::CreateContext(&context);

    aclrtStream stream = nullptr;
    status = aclrtCreateStream(&stream);
    context->SetExecuteStream(stream);

    // step6:调用Setup接口,计算workspace大小
    uint64_t workspaceSize = 0;
    st = op->Setup(pack, workspaceSize, context);

    // step7:根据workspace大小申请NPU内存
    void *workspace = nullptr;
    if (workspaceSize != 0) {
        status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (status != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }

    // step8:调用Execute接口,执行算子
    st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);

    // step9:销毁创建的对象,释放内存
    // 流同步,作用是等待device侧任务计算完成
    auto ret = aclrtSynchronizeStream(stream);
    if (ret != 0) {
        std::cout << "sync error!";
        exit(0);
    }

    // 打印输出Tensor的值
    PrintOutTensorValue(pack.outTensors.at(0));

    status = aclrtDestroyStream(stream); // 销毁stream
    st = atb::DestroyOperation(op);      // 销毁op对象
    st = atb::DestroyContext(context);   // 销毁context
    // 销毁输入tensor
    for (size_t i = 0; i < pack.inTensors.size(); i++) {
        aclrtFree(pack.inTensors.at(i).deviceData);
    }
    // 销毁输出tensor
    for (size_t i = 0; i < pack.outTensors.size(); i++) {
        aclrtFree(pack.outTensors.at(i).deviceData);
    }
    status = aclrtFree(workspace);       // 销毁workspace
    aclrtResetDevice(deviceId);          // 重置deviceId

    return 0;
}

编译运行:

# g++编译demo工程,demo.cpp为demo对应的源码文件
g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_add_graph_by_tensor_id.cpp -l atb -l ascendcl -o atb_add_graph_by_tensor_id

# 运行可执行文件
./atb_add_graph_by_tensor_id

# 如果运行出现coredump,尝试在g++的编译命令中添加-D_GLIBCXX_USE_CXX11_ABI=0,也就是上述的编译命令为:
#g++ -D_GLIBCXX_USE_CXX11_ABI=0 -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_add_graph_by_tensor_id.cpp -l atb -l ascendcl -o atb_add_graph_by_tensor_id

3.3.2按照组图方式2:配置TensorName实现。

image.png

文件命名为atb_add_graph_by_tensor_name.cpp

// step1:包含ACL与加速库接口头文件
#include <iostream>
#include <vector>
#include <acl/acl.h>
#include <atb/atb_infer.h>
#include <atb/types.h>
#include <atb/utils.h>
#include "atb/infer_op_params.h"


void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) 
{
    for (size_t i = 0; i < intensorDescs.size(); i++) {
        intensorDescs.at(i).dtype = ACL_FLOAT16;
        intensorDescs.at(i).format = ACL_FORMAT_ND;
        intensorDescs.at(i).shape.dimNum = 2;
        intensorDescs.at(i).shape.dims[0] = 2;
        intensorDescs.at(i).shape.dims[1] = 2;
    }
}

// 设置各个intensor并且为各个intensor分配内存空间,此处的intensor为手动设置,工程实现上可以使用torchTensor转换或者其他简单数据结构转换的方式
void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs)
{
    for (size_t i = 0; i < inTensors.size(); i++) {
        inTensors.at(i).desc = intensorDescs.at(i);
        inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i));
        std::vector<uint16_t> hostData(atb::Utils::GetTensorNumel(inTensors.at(i)), 2);   // 一段全2的hostBuffer
        int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU内存
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
        ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, hostData.data(), hostData.size() * sizeof(uint16_t), ACL_MEMCPY_HOST_TO_DEVICE); //拷贝CPU内存到NPU侧
    }
}

// 设置各个outtensor并且为outtensor分配内存空间,同intensor设置
void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs)
{
    for (size_t i = 0; i < outTensors.size(); i++) {
        outTensors.at(i).desc = outtensorDescs.at(i);
        outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i));
        int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }
}

static uint64_t DIM3 = 3;

struct LlamaMlpParamGb {
    bool transpose = true;
};

atb::Operation* Linear(const LlamaMlpParamGb &param)
{
    atb::Operation* op = nullptr;
    atb::infer::LinearParam linearParam;
    linearParam.hasBias = false;
    linearParam.transposeB = param.transpose;
    CreateOperation(linearParam, &op);
    return op;
}

atb::Operation* Split(const LlamaMlpParamGb &param)
{
    atb::Operation* op = nullptr;
    atb::infer::SplitParam splitParam = {2, 2};
    CreateOperation(splitParam, &op);
    return op;
}

atb::Operation* Swish(const LlamaMlpParamGb &param)
{
    atb::Operation* op = nullptr;
    atb::infer::ActivationParam activationParam;
    activationParam.activationType = atb::infer::ActivationType::ACTIVATION_SWISH;
    CreateOperation(activationParam, &op);
    return op;
}

atb::Operation* Mul(const LlamaMlpParamGb &param)
{
    atb::Operation* op = nullptr;
    atb::infer::ElewiseParam elewiseParam;
    elewiseParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_MUL;
    CreateOperation(elewiseParam, &op);
    return op;
}

atb::Status CreateLlamaMlpOperationByGraphOpBuilder(const LlamaMlpParamGb &param, atb::Operation **operation)
{
    atb::InferShapeFunc inferShapeFunc = [=](const atb::SVector<atb::TensorDesc> &inTensorDescs,
                                atb::SVector<atb::TensorDesc> &outTensorDescs) {
        outTensorDescs.at(0) = inTensorDescs.at(0);
        if (param.transpose == true) {
            outTensorDescs.at(0).shape.dimNum = DIM3;
            outTensorDescs.at(0).shape.dims[0] = inTensorDescs.at(0).shape.dims[0];
            outTensorDescs.at(0).shape.dims[1] = inTensorDescs.at(0).shape.dims[1];
            outTensorDescs.at(0).shape.dims[2] = inTensorDescs.at(1).shape.dims[0] / 2;
        } else {
            outTensorDescs.at(0).shape.dimNum = DIM3;
            outTensorDescs.at(0).shape.dims[0] = inTensorDescs.at(0).shape.dims[0];
            outTensorDescs.at(0).shape.dims[1] = inTensorDescs.at(0).shape.dims[1];
            outTensorDescs.at(0).shape.dims[2] = inTensorDescs.at(1).shape.dims[1] / 2;
        }
        return atb::NO_ERROR;
    };

    atb::ReshapeFunc reshape_01_2 = [](const atb::Dims &oldShape, atb::Dims &newShape) {
        newShape.dimNum = 2; // dimNum: 2
        newShape.dims[0] = oldShape.dims[0] * oldShape.dims[1];
        newShape.dims[1] = oldShape.dims[1];
    };
    atb::ReshapeFunc unsqueueze_0 = [](const atb::Dims &oldShape, atb::Dims &newShape) {
        newShape.dimNum = 3; // dimNum: 3
        newShape.dims[0] = 1;
        newShape.dims[1] = oldShape.dims[0];
        newShape.dims[2] = oldShape.dims[1];
    };
    atb::GraphOpBuilder* graphOpBuilder;
    CreateGraphOpBuilder(&graphOpBuilder);

    graphOpBuilder->Init(
        "LlamaMlpGraphOp",
        inferShapeFunc,
        {"hidden_states", "weight"},
        {"mlp_out"}
    );

    graphOpBuilder->Reshape("hidden_states", reshape_01_2, "hidden_states_");
    graphOpBuilder->AddOperation(Linear(param), {"hidden_states_", "weight"}, {"linear_out"});
    graphOpBuilder->Reshape("linear_out", unsqueueze_0, "linear_out_");
    graphOpBuilder->AddOperation(Split(param), {"linear_out_"}, {"gate_out", "up_out"});
    graphOpBuilder->AddOperation(Swish(param), {"gate_out"}, {"swish_out"});
    graphOpBuilder->AddOperation(Mul(param), {"swish_out", "up_out"}, {"mlp_out"});

    *operation = graphOpBuilder->Build();
    DestroyGraphOpBuilder(graphOpBuilder);
    return atb::NO_ERROR;
}

void PrintOutTensorValue(atb::Tensor &outTensor)
{
    // 输出Tensor拷贝回host侧并打印
    std::vector<uint16_t> outBuffer(atb::Utils::GetTensorNumel(outTensor));
    int ret = aclrtMemcpy(outBuffer.data(), outBuffer.size() * sizeof(uint16_t), outTensor.deviceData, outTensor.dataSize, ACL_MEMCPY_DEVICE_TO_HOST);
    if (ret != 0) {
        std::cout << "copy error!";
        exit(0);
    }
    for (size_t i = 0; i < outBuffer.size(); i = i + 1) {
        std::cout << "out[" << i << "] = " << (uint32_t)outBuffer.at(i) << std::endl;
    }
}

int main() {
    // step2:配置deviceId
    uint32_t deviceId = 0;
    aclError status = aclrtSetDevice(deviceId);

    // step3:创建图算子对象实例
    // 第一步:构造Operation参数
    atb::Operation *op = nullptr;
    ::LlamaMlpParamGb opGraph;

    // 第二步:创建opGraph
    CreateLlamaMlpOperationByGraphOpBuilder(opGraph, &op);

    // step4:创建输入输出tensor,并存入VariantPack
    atb::VariantPack pack;
    atb::SVector<atb::TensorDesc> intensorDescs;
    atb::SVector<atb::TensorDesc> outtensorDescs;

    uint32_t inTensorNum = op->GetInputNum();
    uint32_t outTensorNum = op->GetOutputNum();
    pack.inTensors.resize(inTensorNum);
    intensorDescs.resize(inTensorNum);

    CreateInTensorDescs(intensorDescs);
    CreateInTensors(pack.inTensors, intensorDescs);
        
    outtensorDescs.resize(outTensorNum);
    pack.outTensors.resize(outTensorNum);
    op->InferShape(intensorDescs, outtensorDescs);
    CreateOutTensors(pack.outTensors, outtensorDescs);

    // step5:创建context,配置stream
    atb::Context *context = nullptr;
    auto st = atb::CreateContext(&context);

    aclrtStream stream = nullptr;
    status = aclrtCreateStream(&stream);
    context->SetExecuteStream(stream);

    // step6:调用Setup接口,计算workspace大小
    uint64_t workspaceSize = 0;
    st = op->Setup(pack, workspaceSize, context);

    // step7:根据workspace大小申请NPU内存
    void *workspace = nullptr;
    if (workspaceSize != 0) {
        status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (status != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }

    // step8:调用Execute接口,执行算子
    st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);

    // step9:销毁创建的对象,释放内存
    // 流同步,作用是等待device侧任务计算完成
    auto ret = aclrtSynchronizeStream(stream);
    if (ret != 0) {
        std::cout << "sync error!";
        exit(0);
    }

   // 打印输出Tensor的值
    PrintOutTensorValue(pack.outTensors.at(0));

    status = aclrtDestroyStream(stream); // 销毁stream
    st = atb::DestroyOperation(op);      // 销毁op对象
    st = atb::DestroyContext(context);   // 销毁context
    // 销毁输入tensor
    for (size_t i = 0; i < pack.inTensors.size(); i++) {
        aclrtFree(pack.inTensors.at(i).deviceData);
    }
    // 销毁输出tensor
    for (size_t i = 0; i < pack.outTensors.size(); i++) {
        aclrtFree(pack.outTensors.at(i).deviceData);
    }
    status = aclrtFree(workspace);       // 销毁workspace
    aclrtResetDevice(deviceId);          // 重置deviceId

    return 0;
}

编译运行:

# g++编译demo工程,demo.cpp为demo对应的源码文件
g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_add_graph_by_tensor_name.cpp -l atb -l ascendcl -o atb_add_graph_by_tensor_name

# 运行可执行文件
./atb_add_graph_by_tensor_name

# 如果运行出现coredump,尝试在g++的编译命令中添加-D_GLIBCXX_USE_CXX11_ABI=0,也就是上述的编译命令为:
#g++ -D_GLIBCXX_USE_CXX11_ABI=0 -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,001评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,210评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,874评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,001评论 1 291
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,022评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,005评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,929评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,742评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,193评论 1 309
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,427评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,583评论 1 346
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,305评论 5 342
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,911评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,564评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,731评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,581评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,478评论 2 352

推荐阅读更多精彩内容