Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

Introduction

用来估计视频中周期动作的周期

贡献：

RepNet to leverages a temporal self-similarity matrix.
A method to generate and augment synthetic repetition videos from unlabeled videos.
We use a combination of real and synthetic data to develop our model.

method

network- RepNet

输入 N 帧图像：

V = [v1, v2, ..., vN ]

，编码器：X = φ(V ) ，经过编码器得到：

X = {[x1, x2, ..., xN ]}^T

,从X中提取出self-similarity matrix S. 将S送入周期预测模块，这部分会产生两部分输出：period length：

l=\psi(S)

,表示重复周期频率， periodicity score:

p=\tau(S)

，此帧图片动作是否属于计数动作。

Encoder（ $\varphi$ ）

resnet-50用来提取图像特征，输入112x112x3，使用conv4_block3层的输出作为提取的2D特征（7x7x1024），将所有的2D图像特征经过已成3D卷积层，3D卷积层的参数：channels=512,kernal_size=[3,3,3], activation_fn=ReLU.然后通过一层MaxPooling2D层得到 $\{x_{i}|i=1,2,3,...N\}$ .

TSM（ Temporal Self-similarity Matrix）

相似度： $S_{ij}=f(x_i,x_j)$ , $f(a,b)=SoftMax(-{|| a-b||}^2)$ 。由于TSM矩阵是单通道，这就是论文所提到的网络中的“瓶颈”。
实现：
通过编码器可以得到输出 embeding features: $X$ [64,512]，其中64帧图像，每帧图像的特征都是一维向量，将该向量平方和之后作为每幅图像的特征，即 $a$ and $b$ , $dist=-{(a-2a b+b)}$ , $f(a,b)=tf.softmax(dist/\rho)$ .

Period Predictor

网络输入： $S={[s1,s2,...sN]}^T$ ,预测器 $\psi$ and $\tau$ ,共用一部分的网络结构：2D卷积层（32x3x3）+transformer 层，两个分支部分采用两层512大小的全连接层。

loss

$p=\tau(S)$ ：binary cross-entropy：二分类任务

$l=\psi(S)$ ： softmax crossentropy ：多任务分类

class： $L={2,3,...,\frac{N}{2}}$ ,N=64：输入视频帧数

inference

将连续的视频切分成N帧的输入，输入RepNet，得到每一帧的的计数值： $\frac{p_{i}}{l_{i}}$ ,整体的计数为： $\sum _{i=1}^{N}\frac{p_{i}}{l_{i}}$
(Tip:为了增加鲁棒性，将原始的视频进行采样(x1,x2,x3,x4),得到预测分数更高的帧率作为计数的最佳选择。
实现：
按照batch输入到网络中，得到每种采样速率的网络输出结果，对于每种采样结果，需要进行一个打分的过程。
首先预测准确率（ periodicity score），是网络分支的输出经过sigmoid 激活函数得到，由于预测时间长度（period length）的任务是多任务分类问题（2，3，...,32）,统计预测概率，得到最可能的分类类别，并将该类的预测置信度经过softmax，得到预测置信度，然后按照预测类别：如果类别<3，就将置信度置为0.0，periodicity score 乘以每帧的预测置信度开根号来重置periodicity score，对所有的periodicity score 求均值，即为该采样率下的总体置信度。

def get_score(period_score, within_period_score):
    """Combine the period and periodicity scores."""
    within_period_score = tf.nn.sigmoid(within_period_score)[:, 0]
    per_frame_periods = tf.argmax(period_score, axis=-1) + 1
    pred_period_conf = tf.reduce_max(
        tf.nn.softmax(period_score, axis=-1), axis=-1)
    pred_period_conf = tf.where(
        tf.math.less(per_frame_periods, 3), 0.0, pred_period_conf)
    within_period_score *= pred_period_conf
    within_period_score = np.sqrt(within_period_score)
    pred_score = tf.reduce_mean(within_period_score)
    return pred_score, within_period_score

选择总体置信度最大的采样率下的结果，按照采样步长重复网络的输出结果，就可以恢复到原来的视频帧率，按照设定的选择阈值（0.5），选择满足同类动作的关键帧来计数，这里在实现的时候对阈值操作之后的periodicity score，进行了中值滤波操作。
计数的结果：

$Counter_{i}=\frac{1}{stirde\times l} l>3$

$Counter_{i}=\frac{1}{stirde\times l} l<=3$

同样对计数结果进行了中值滤波操作。
最终的预测置信度就是对于满足periodicity score阈值的平均值，如果这个最终的置信度<0.2，那么认为输入中没有周期性动作。

counts = len(frames) * [count/len(frames)]
sum_counts = np.cumsum(counts)

数据生成

随机选取视频的一部分，然后重复K次，期间可以加入随机翻转来模拟周期动作效果，然后拼接成视频，过程再随机的加入数据增强操作等等。

Implement

增加了：移除特征提取主干网络，采用其他特征构建TSM
https://github.com/CvHadesSun/RepNet

Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

Introduction

method

network- RepNet

Encoder（ $\varphi$ ）

TSM（ Temporal Self-similarity Matrix）

Period Predictor

loss

inference

数据生成

Implement

Contrbute

推荐阅读更多精彩内容

Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

Introduction

method

network- RepNet

Encoder（）

TSM（ Temporal Self-similarity Matrix）

Period Predictor

loss

inference

数据生成

Implement

Contrbute

推荐阅读更多精彩内容

Encoder（ $\varphi$ ）