mmaction2 部署
这里先在windows上部署测试
conda create -n mmaction2 --clone openmmlab
pip install -r requirements/build.txt
pip install -v -e .
注意mmcv-full 版本 小于1.4.2
测试
import torch
from mmaction.apis import init_recognizer, inference_recognizer
config_file = 'configs/recognition/tsn/tsn_r50_video_inference_1x1x3_100e_kinetics400_rgb.py'
device = 'cuda:0' # or 'cpu'
device = torch.device(device)
model = init_recognizer(config_file, device=device)
# inference the demo video
inference_recognizer(model, 'demo/demo.mp4')
数据集准备
Something数据集是一个大型的带有标签的记录了人类与日常生活中的一些物体之间的动作数据集,动作的类别共174类something-V1和something-V2的主要区别就是V2的视频数量更多了,从V1的108,499增加到了220,847。v2链接:https://pan.baidu.com/s/1NCqL7JVoFZO6D131zGls-A
提取码:07ka
对数据及的划分的代码的话推荐TSM的作者放出来的划分代码,可以轻松根据原始的csv文件把数据集划分成训练、验证、以及测试数据集
https://github.com/mit-han-lab/temporal-shift-module/tree/master/tools
解压拼接数据集
cat 20bn-something-something-v2-?? | tar zx
安装ffmpeg
下载到本地
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz
解压
tar -xvf ffmpeg-git-amd64-static.tar.xz
cd ffmpeg-git-20220302-amd64-static/
用代码来批量将视频转为数据帧
里面调用了ffmpeg,可以在cmd那里修改参数
from __future__ import print_function, division
import os
import sys
import subprocess
def class_process(dir_path, dst_dir_path):
class_path = dir_path
if not os.path.isdir(class_path):
return
dst_class_path = dst_dir_path
if not os.path.exists(dst_class_path):
os.mkdir(dst_class_path)
for file_name in os.listdir(class_path):
if '.webm' not in file_name:
continue
name, ext = os.path.splitext(file_name)
dst_directory_path = os.path.join(dst_class_path, name)
video_file_path = os.path.join(class_path, file_name)
try:
if os.path.exists(dst_directory_path):
if not os.path.exists(os.path.join(dst_directory_path, '000001.jpg')):
subprocess.call('rm -r \"{}\"'.format(dst_directory_path), shell=True)
print('remove {}'.format(dst_directory_path))
os.mkdir(dst_directory_path)
else:
continue
else:
os.mkdir(dst_directory_path)
except:
print(dst_directory_path)
continue
#调用ffmpeg工具进行分视频帧
cmd = 'ffmpeg -i \"{}\" -vf scale=-1:240 \"{}/%06d.jpg\"'.format(video_file_path, dst_directory_path)
print(cmd)
#运行脚本
subprocess.call(cmd, shell=True)
print('\n')
if __name__=="__main__":
print ("HELLO")
dir_path = sys.argv[1]
dst_dir_path = sys.argv[2]
count=0
for class_name in os.listdir(dir_path):
print (count)
count=count+1
class_process(dir_path, dst_dir_path)
python video_jpg_ucf101_hmdb51.py /mnt/e/BaiduNetdiskDownload/somethingV2/20bn-something-something-v2/ /mnt/e/workspace/mmaction2/data/somethingv2/
这个跑了有4、5天大概
sthv2 数据训练
_base_ = ['../../_base_/default_runtime.py']
# model settings
model = dict(
type='Recognizer3D',
backbone=dict(
type='TimeSformer',
pretrained= # noqa: E251
'https://download.openmmlab.com/mmaction/recognition/timesformer/vit_base_patch16_224.pth', # noqa: E501
num_frames=8,
img_size=224,
patch_size=16,
embed_dims=768,
in_channels=3,
dropout_ratio=0.,
transformer_layers=None,
attention_type='divided_space_time',
norm_cfg=dict(type='LN', eps=1e-6)),
cls_head=dict(type='TimeSformerHead', num_classes=174, in_channels=768),
# model training and testing settings
train_cfg=None,
test_cfg=dict(average_clips='prob'))
# dataset settings
#直接使用视频格式
dataset_type = 'VideoDataset'
data_root = 'data/sthv2/videos'
data_root_val = 'data/sthv2/videos'
ann_file_train = 'data/sthv2/sthv2_train_list_videos.txt'
ann_file_val = 'data/sthv2/sthv2_val_list_videos.txt'
ann_file_test = 'data/sthv2/sthv2_val_list_videos.txt'
#转为数据帧格式
#dataset_type = 'RawframeDataset'
#data_root = 'data/sthv2/rawframes'
#data_root_val = 'data/sthv2/rawframes'
#ann_file_train = 'data/sthv2/sthv2_train_list_rawframes.txt'
#ann_file_val = 'data/sthv2/sthv2_val_list_rawframes.txt'
#ann_file_test = 'data/sthv2/sthv2_val_list_rawframes.txt'
img_norm_cfg = dict(
mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_bgr=False)
train_pipeline = [
dict(type='DecordInit'), #由于是视频,需要先加编解码
dict(type='SampleFrames', clip_len=8, frame_interval=30, num_clips=1), #数据帧采样,表示沿时序维度方向,以间隔为8帧的方式,采集30帧图像。
#num_clips = N 相当于对一个视频进行 N 次 sample、测试,将结果 ensemble,如1表示采集一个clip(可以简单理解为batch_size=1,后续其会被覆盖)
dict(type='DecordDecode'), #视频数据需要解码
dict(type='RandomRescale', scale_range=(256, 320)),
dict(type='RandomCrop', size=224),
dict(type='Flip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'), # 调整输出形状
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), # 统一数据格式
dict(type='ToTensor', keys=['imgs', 'label'])# 转换为pytorch需要的Tensor数组
]
val_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=8,
frame_interval=30,
num_clips=1,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]
test_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=8,
frame_interval=30,
num_clips=1,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='ThreeCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]
data = dict(
videos_per_gpu=2,#每个GPU加载2个视频数据,可以理解为batch_size
workers_per_gpu=2, #每个GPU分配2个线程
test_dataloader=dict(videos_per_gpu=1),
# 指定训练,验证,测试数据集路径文件夹配置
train=dict(
type=dataset_type,
ann_file=ann_file_train,
data_prefix=data_root,
pipeline=train_pipeline),
val=dict(
type=dataset_type,
ann_file=ann_file_val,
data_prefix=data_root_val,
pipeline=val_pipeline),
test=dict(
type=dataset_type,
ann_file=ann_file_test,
data_prefix=data_root_val,
pipeline=test_pipeline))
#评估指标
evaluation = dict(
interval=1, metrics=['top_k_accuracy', 'mean_class_accuracy'])
# optimizer,模型训练的优化器
optimizer = dict(
type='SGD',
lr=0.005/8/4,
momentum=0.9,
paramwise_cfg=dict(
custom_keys={ #冻结骨架的偏执,即不是训练backbone
'.backbone.cls_token': dict(decay_mult=0.0),
'.backbone.pos_embed': dict(decay_mult=0.0),
'.backbone.time_embed': dict(decay_mult=0.0)
}),
weight_decay=1e-4,
nesterov=True) # this lr is used for 8 gpus
optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy,学习率的衰减策略
# lr_config = dict(policy='CosineAnealing', min_lr=0)
lr_config = dict(policy='step', step=[5,10])
total_epochs = 15
# runtime settings
checkpoint_config = dict(interval=1) #多少代间隔保存一次模型
work_dir = './work_dirs/timesformer_divST_8x32x1_ssv2'
学习率根据GPU个数和batch大小改变,
原来是8个GPU * batchsize = 8 现在是1个GPU * batchsize=2,
lr=0.005/8/4
这里先使用的是video格式数据,直接训练
从头训练验证
python tools/train.py configs/recognition/timesformer/timesformer_divST_8x32x1_ssv2.py --work-dir work_dirs/timesformer_divST_8x32x1_ssv2 --gpus 0
如果在windows 无gpu情况要设置gpu为0
也可加随机数
python tools/train.py configs/recognition/timesformer/timesformer_divST_8x32x1_ssv2.py --work-dir work_dirs/timesformer_divST_8x32x1_ssv2
--validate --seed 0 --deterministic
断点续练
python tools/train.py work_dirs/timesformer_divST_8x32x1_ssv2/timesformer_divST_8x32x1_ssv2.py --work-dir work_dirs/timesformer_divST_8x32x1_ssv2 --gpus 0 --resume-from work_dirs/timesformer_divST_8x32x1_ssv2/epoch_9.pth
验证测试
python tools/test.py configs/recognition/timesformer/timesformer_divST_8x32x1_ssv2.py work_dirs/timesformer_divST_8x32x1_ssv2/epoch_6.pth --eval top_k_accuracy mean_class_accuracy --out result6.json
保存结果
调用摄像头实时推理
python .\demo\webcam_demo.py .\work_dirs\timesformer_divST_8x32x1_ssv2\timesformer_divST_8x32x1_ssv2.py .\work_dirs\timesformer_divST_8x32x1_ssv2\epoch_15.pth .\tools\data\sthv2\label_map.txt --average-size 5 --threshold 0.2
自定义数据集
以tiny数据集为例,这里就两个类,训练30个视频,开发测试10个视频,也是先使用视频直接训练
tsn 测试
import os.path as osp
from mmaction.datasets import build_dataset
from mmaction.models import build_model
from mmaction.apis import train_model
import mmcv
from mmcv import Config
cfg = Config.fromfile('./configs/recognition/tsn/tsn_r50_video_1x1x8_100e_kinetics400_rgb.py')
from mmcv.runner import set_random_seed
# Modify dataset type and path
cfg.dataset_type = 'VideoDataset'
cfg.data_root = 'data/kinetics400_tiny/train/'
cfg.data_root_val = 'data/kinetics400_tiny/val/'
cfg.ann_file_train = 'data/kinetics400_tiny/kinetics_tiny_train_video.txt'
cfg.ann_file_val = 'data/kinetics400_tiny/kinetics_tiny_val_video.txt'
cfg.ann_file_test = 'data/kinetics400_tiny/kinetics_tiny_val_video.txt'
#cfg.data.videos_per_gpu=1
#cfg.data.workers_per_gpu=1
cfg.data.test.type = 'VideoDataset'
cfg.data.test.ann_file = 'data/kinetics400_tiny/kinetics_tiny_val_video.txt'
cfg.data.test.data_prefix = 'data/kinetics400_tiny/val/'
cfg.data.train.type = 'VideoDataset'
cfg.data.train.ann_file = 'data/kinetics400_tiny/kinetics_tiny_train_video.txt'
cfg.data.train.data_prefix = 'data/kinetics400_tiny/train/'
cfg.data.val.type = 'VideoDataset'
cfg.data.val.ann_file = 'data/kinetics400_tiny/kinetics_tiny_val_video.txt'
cfg.data.val.data_prefix = 'data/kinetics400_tiny/val/'
# The flag is used to determine whether it is omnisource training
cfg.setdefault('omnisource', False)
# Modify num classes of the model in cls_head
cfg.model.cls_head.num_classes = 2
# We can use the pre-trained TSN model
cfg.load_from = './checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth'
# Set up working dir to save files and logs.
cfg.work_dir = './test'
# The original learning rate (LR) is set for 8-GPU training.
# We divide it by 8 since we only use one GPU.
cfg.data.videos_per_gpu = cfg.data.videos_per_gpu // 16
cfg.optimizer.lr = cfg.optimizer.lr / 8 / 16
cfg.total_epochs = 10
# We can set the checkpoint saving interval to reduce the storage cost
cfg.checkpoint_config.interval = 5
# We can set the log print interval to reduce the the times of printing log
cfg.log_config.interval = 5
# Set seed thus the results are more reproducible
cfg.seed = 0
set_random_seed(0, deterministic=False)
cfg.gpu_ids = range(1)
# Save the best
cfg.evaluation.save_best='auto'
# Build the dataset
datasets = [build_dataset(cfg.data.train)]
# Build the recognizer
model = build_model(cfg.model, train_cfg=cfg.get('train_cfg'), test_cfg=cfg.get('test_cfg'))
# Create work_dir
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
train_model(model, datasets, cfg, distributed=False, validate=True)
from mmaction.apis import single_gpu_test
from mmaction.datasets import build_dataloader
from mmcv.parallel import MMDataParallel
# Build a test dataloader
dataset = build_dataset(cfg.data.test, dict(test_mode=True))
data_loader = build_dataloader(
dataset,
videos_per_gpu=1,
workers_per_gpu=cfg.data.workers_per_gpu,
dist=False,
shuffle=False)
model = MMDataParallel(model, device_ids=[0])
outputs = single_gpu_test(model, data_loader)
eval_config = cfg.evaluation
eval_config.pop('interval')
eval_res = dataset.evaluate(outputs, **eval_config)
for name, val in eval_res.items():
print(f'{name}: {val:.04f}')
同样根据 GPU个数和videos_per_gpu 数修改lr
主要跑通测试tsn
接着才是重点,要用TimeSformer训练
timesformer_divST_8x32x1_15e_kinetics_tiny.py
_base_ = ['../../_base_/runtimetiny.py']
# model settings
model = dict(
type='Recognizer3D',
backbone=dict(
type='TimeSformer',
pretrained= # noqa: E251
'https://download.openmmlab.com/mmaction/recognition/timesformer/vit_base_patch16_224.pth', # noqa: E501
num_frames=8,
img_size=224,
patch_size=16,
embed_dims=768,
in_channels=3,
dropout_ratio=0.,
transformer_layers=None,
attention_type='divided_space_time',
norm_cfg=dict(type='LN', eps=1e-6)),
cls_head=dict(type='TimeSformerHead', num_classes=2, in_channels=768),
# model training and testing settings
train_cfg=None,
test_cfg=dict(average_clips='prob'))
# dataset settings
dataset_type = 'VideoDataset'
data_root = 'data/kinetics400_tiny/train'
data_root_val = 'data/kinetics400_tiny/val'
ann_file_train = 'data/kinetics400_tiny/kinetics_tiny_train_video.txt'
ann_file_val = 'data/kinetics400_tiny/kinetics_tiny_val_video.txt'
ann_file_test = 'data/kinetics400_tiny/kinetics_tiny_val_video.txt'
img_norm_cfg = dict(
mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_bgr=False)
train_pipeline = [
dict(type='DecordInit'),
dict(type='SampleFrames', clip_len=8, frame_interval=32, num_clips=1),
dict(type='DecordDecode'),
dict(type='RandomRescale', scale_range=(256, 320)),
dict(type='RandomCrop', size=224),
dict(type='Flip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]
val_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=8,
frame_interval=32,
num_clips=1,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]
test_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=8,
frame_interval=32,
num_clips=1,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='ThreeCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]
data = dict(
videos_per_gpu=2,
workers_per_gpu=2,
test_dataloader=dict(videos_per_gpu=1),
train=dict(
type=dataset_type,
ann_file=ann_file_train,
data_prefix=data_root,
pipeline=train_pipeline),
val=dict(
type=dataset_type,
ann_file=ann_file_val,
data_prefix=data_root_val,
pipeline=val_pipeline),
test=dict(
type=dataset_type,
ann_file=ann_file_test,
data_prefix=data_root_val,
pipeline=test_pipeline))
evaluation = dict(
interval=1, metrics=['top_k_accuracy', 'mean_class_accuracy'])
# optimizer
optimizer = dict(
type='SGD',
lr=0.005/8,
momentum=0.9,
paramwise_cfg=dict(
custom_keys={
'.backbone.cls_token': dict(decay_mult=0.0),
'.backbone.pos_embed': dict(decay_mult=0.0),
'.backbone.time_embed': dict(decay_mult=0.0)
}),
weight_decay=1e-4,
nesterov=True) # this lr is used for 8 gpus
optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy
lr_config = dict(policy='step', step=[5, 8])
total_epochs = 10
# runtime settings
checkpoint_config = dict(interval=1)
work_dir = './work_dirs/timesformer_divST_8x32x1_15e_kinetics_tiny'
python tools/train.py configs/recognition/timesformer/timesformer_divST_8x32x1_15e_kinetics_tiny.py --gpus 0
推理测试
cat tinyinfer.py
from mmaction.apis import inference_recognizer, init_recognizer
import os
# Choose to use a config and initialize the recognizer
config = 'configs/recognition/timesformer/timesformer_divST_8x32x1_15e_kinetics_tiny.py'
# Setup a checkpoint file to load
checkpoint = 'work_dirs/timesformer_divST_8x32x1_15e_kinetics_tiny/epoch_10.pth'
# Initialize the recognizer
model = init_recognizer(config, checkpoint, device='cuda:0')
# Use the recognizer to do inference
label = 'tools/data/kinetics/label_map_k2.txt'
labels = open(label).readlines()
labels = [x.strip() for x in labels]
path = 'data/kinetics400_tiny/val' #
for root, dirs, names in os.walk(path):
for name in names:
ext = os.path.splitext(name)[1] #
if ext == '.mp4':
video = os.path.join(root, name)
results = inference_recognizer(model, video)
#labels = open(label).readlines()
#labels = [x.strip() for x in labels]
results = [(labels[k[0]], k[1]) for k in results]
print(name)
for result in results:
print(f'{result[0]}: ', result[1])
自己定义的标签文件,0为爬绳,1 为吹玻璃
cat tools/data/kinetics/label_map_k2.txt
climbing a rope
blowing glass
这里是输出每个视频对两个类别的预测概率
日志
python tools/analysis/analyze_logs.py plot_curve work_dirs/timesformer_divST_8x32x1_15e_kinetics_tiny/20220403_010309.log.json --keys top1_acc --out acc1.pdf
日志分析
root@83c3d6970b59:/workspace# python tools/analysis/analyze_logs.py cal_train_time work_dirs/timesformer_divST_8x32x1_15e_kinetics_tiny/20220403_010309.log.json
-----Analyze train time of work_dirs/timesformer_divST_8x32x1_15e_kinetics_tiny/20220403_010309.log.json-----
slowest epoch 5, average time is 0.8540
fastest epoch 4, average time is 0.8354
time std over epochs is 0.0063
average iter time: 0.8425 s/iter
模型复杂度分析
/tools/analysis/get_flops.py 是根据 flops-counter.pytorch 库改编的脚本,用于计算输入变量指定模型的 FLOPs 和参数量。
python tools/analysis/get_flops.py ${CONFIG_FILE} [--shape ${INPUT_SHAPE}]
其他模型部署相关
模型转换
- 导出 MMAction2 模型为 ONNX 格式(实验特性)
/tools/deployment/pytorch2onnx.py
脚本用于将模型转换为 ONNX 格式。 同时,该脚本支持比较 PyTorch 模型和 ONNX 模型的输出结果,验证输出结果是否相同。 本功能依赖于onnx
以及onnxruntime
,使用前请先通过pip install onnx onnxruntime
安装依赖包。 请注意,可通过--softmax
选项在行为识别器末尾添加 Softmax 层,从而获取[0, 1]
范围内的预测结果。
对于行为识别模型,请运行:
python tools/deployment/pytorch2onnx.py $CONFIG_PATH $CHECKPOINT_PATH --shape $SHAPE --verify
对于时序动作检测模型,请运行:
python tools/deployment/pytorch2onnx.py $CONFIG_PATH $CHECKPOINT_PATH --is-localizer --shape $SHAPE --verify
- 发布模型
tools/deployment/publish_model.py
脚本用于进行模型发布前的准备工作,主要包括:
(1) 将模型的权重张量转化为 CPU 张量。 (2) 删除优化器状态信息。 (3) 计算模型权重文件的哈希值,并将哈希值添加到文件名后。
python tools/deployment/publish_model.py ${INPUT_FILENAME} ${OUTPUT_FILENAME}
例如,
python tools/deployment/publish_model.py work_dirs/tsn_r50_1x1x3_100e_kinetics400_rgb/latest.pth tsn_r50_1x1x3_100e_kinetics400_rgb.pth
最终,输出文件名为 tsn_r50_1x1x3_100e_kinetics400_rgb-{hash id}.pth
。
5- 指标评价
tools/analysis/eval_metric.py
脚本通过输入变量指定配置文件,以及对应的结果存储文件,计算某一评价指标。
结果存储文件通过 tools/test.py
脚本(通过参数 --out ${RESULT_FILE}
指定)生成,保存了指定模型在指定数据集中的预测结果。
python tools/analysis/eval_metric.py ${CONFIG_FILE} ${RESULT_FILE} [--eval ${EVAL_METRICS}] [--cfg-options ${CFG_OPTIONS}] [--eval-options ${EVAL_OPTIONS}]
6- 打印完整配置
tools/analysis/print_config.py
脚本会解析所有输入变量,并打印完整配置信息。
python tools/print_config.py ${CONFIG} [-h] [--options ${OPTIONS [OPTIONS...]}]
检查视频
tools/analysis/check_videos.py 脚本利用指定视频编码器,遍历指定配置文件视频数据集中所有样本,寻找无效视频文件(文件破损或者文件不存在),并将无效文件路径保存到输出文件中。请注意,删除无效视频文件后,需要重新生成视频文件列表。
python tools/analysis/check_videos.py ${CONFIG} [-h] [--options OPTIONS [OPTIONS ...]] [--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]] [--output-file OUTPUT_FILE] [--split SPLIT] [--decoder ]