YOLOv3: An Incremental Improvement
Abstract
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50 in 51 ms on a Titan X, compared to 57.5 AP50 in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at https://pjreddie.com/yolo/.
我们发布了YOLO的更新!我们用了很多小的设计改变来优化它。我们也训练了这个新的非常强大的网络。虽然有点大,但更准确。不用担心,速度仍然很快。在320 x 320的YOLOV3运行了22ms,28.2mAP,准确率和SSD一样,但速度提升了3倍。当我们看老版的.5 IOU mAP 检测度量,YOLOv3非常棒。它在titanX上能达到57.9mAP在50到51ms,相比RetinaNet57.5mAP 50 198ms,类似准确率,但是快了3.8倍。和往常一样,代码上传到https://pjreddie.com/yolo/.
1. Introduction
Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [10] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little.
你们召唤它一年了,知道吗?今年我没有做太多研究。花了很多时间在Twitter上,玩了一下GAN。去年也剩下一些momentum,我也成功对优化了YOLO。但,坦白来讲,最有趣的事情,就是通过很多很小的改变来优化它。我还帮别人做了一些研究。
Actually, that’s what brings us here today. We have a camera-ready deadline and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT!
事实上,这就是今天我们来到这里的原因。我们有一个相机准备的最后期限,我们需要举出一些随机更新,但我们没有一个源。准备好一份科技报告!
The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means.
科技报告最棒的事情就是我们不需要简介,你们都知道为什么我们在这。所以,简介的其他部分会引出论文的其他部分。首先,我会YOLOv3解决了什么,然后我告诉你我们怎么做的,我们也会讲我们做的改变但失败了,最后我们总结这些意味着什么。
2. The Deal
So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all.
所以,YOLOv3解决了什么:我们大部分是从其他人那里得到好的想法。我们也训练一个新的比另一个些好的分类器网络。我们会一点点讲整个系统,这样你就能完全理解了。
2.1 Bounding Box Prediction
Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [13]. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:
下面是我们YOLO9000系统使用维度集群作为锚框来预测边框。网络对每个边框预测4个坐标值,tx, ty, tw, ty。如果这个cell变异了左上角(cx, cy)边框,边框先验宽高为pw, ph,那么预测就表示为:
$$b_x = σ(t_x) + c_x $$
$$b_y = σ(t_y) + c_y$$
$$b_w = p_we^{t_w}$$
$$b_h = p_he^{t_h}$$
During training we use sum of squared error loss. If the ground truth for some coordinate prediction is tˆ * our gradient is the ground truth value (computed from the ground truth box) minus our prediction: tˆ* − t*. This ground truth value can be easily computed by inverting the equations above.
在训练期间,我们使用了方差损失的和。如果每个坐标预测的真值是t^ * ,我们的梯度是真值(从真值边框计算而来)减去我们的预测值,t * − t * 。那么真值通过插入上面的等式很容易被计算出来。
YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [15]. We use the threshold of .5. Unlike [15] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.
YOLOv3使用logistic回归来为每一个边框给每个物体打分。如果边框先验比任何其他边框与真值重叠更多,那么它为的分数为1.如果边框先验不是最好的,但与真值覆盖面积大于某个阈值,我们就忽略预测,参照[15]。我们使用.5作为阈值。不像[15],我们的系统为每个真值物体只分配了一个边框。如果这个边框没有被分配到真值物体,那么坐标或者类别预测不加入到损失中,只有对象。
2.2 Class Prediction
Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions.
每个边框预测的类可以有多个标签。我们不需要softmax,因为我们已经发现更好的性能没必要使用softmax,取代的是,我们简单实用了独立的logistic分类器。在训练期间,我们使用二值交叉熵损失来分类做类预测。
This formulation helps when we move to more complex domains like the Open Images Dataset [5]. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.
这个公式很有用,当我们使用更复杂的域,比如开放的Images数据库。在数据集里有很多重叠的标签,比如女人和人。使用Softmax作了一个假设,就是每一个边框只有一个类别,但实际场景并非如此。一个多标签的方法对于数据更好的模型。
2.3 Predictions Across Scale
YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [6]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [8] we predict 3 boxes at each scale so the tensor is N ×N ×[3∗(4+ 1+ 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.
YOLOv3在3种尺度下来预测。我们系统使用特征金字塔网络类似的概念,从这些尺度中提取特征。从我们基本特征提取器,我们增加了几个卷积层。这些最尾端预测一个编码为边框,物体和类预测的3-d张量。在我们COCO实验中,我们每个尺寸预测3个边框,所以张量为 N ×N ×[3∗(4+ 1+ 80)],4个边框偏置,1个物体预测,80个类别预测。
Next we take the feature map from 2 layers previous and upsample it by 2×. We also take a feature map from earlier in the network and merge it with our upsampled features using element-wise addition. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size.
接下来,我们从之前的两层得到特征图,然后2倍上采样。我们也从更前面的网络中得到特征图,然后使用元素级增加的方式和上采样的特征做融合。这个方法让我们从上采样特征中得到更有用的语义信息并从更之前的特征图中得到更细粒信息。当我们增加更多的卷积层来结合特征图,最终预测一个相似的张量,虽然尺寸已经2倍大。
We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network.
我们又用了一次同样的设计来为最后一个尺度预测边框。然后,我们第三个尺度的预测,就得益于所有之前的计算,和从更之前的网络中的细粒特征。
We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (10×13),(16×30),(33×23),(30×61),(62×45),(59×119) (116 × 90),(156 × 198),(373 × 326).
我们仍然使用k-means集群来先验边框。我们只选了9个集群和3个随机尺度,然后将这些集群均匀地分布在各个尺寸上。在COCO级9组:(10×13),(16×30),(33×23),(30×61),(62×45),(59×119) (116 × 90),(156 × 198),(373 × 326).
2.4 Feature Extractor
We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × 3 and 1 × 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it.... wait for it..... Darknet-53!
我们使用了新的网络来进行特征提取。我们新的网络是一种使用在YOLOv2, Darknet-19的网络的混合方法和新奇的残差网络的东西。我们的网络使用连续的3 x 3 和1 x 1的卷积层,但现在也有了捷径连接,明显增大了网络。它有了53个卷积层,所以我们叫它...等等...Darknet-53!
This new network is much more powerful than Darknet19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results:
这个网络比Darknet19更强大,但仍然比ResNet-101和ResNet-152更强大。这里是ImageNet结果:
Each network is trained with identical settings and tested at 256×256, single crop accuracy. Run times are measured on a Titan X at 256 × 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster.
每个网络都在256 x 256相同的配置下训练,单一作物的准确性。运行时间在TitanX 256x256上测试。因此Darknet是目前最好的结果的分类器,但用了更少的浮点运算和更快的速度。Darknet-53比ResNet-101效果更好,而且快1.5倍。Darknet-53有ResNet-152类似的准确性,但快2倍速度。
Darknet-53 also achieves the highest measured floating point operations per second. This means the network structure better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient.
Darknet-53也达到了每秒最高的浮点计算量。这个意味着网络结构更适合GPU,使它验证更有效,也更快。更多是因为ResNets有更多的层,当并不是很高效。
2.5 Training
We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing [12].
我们仍然在整幅图上训练没有难分负样本挖掘和任何其他策略。我们使用多尺寸训练,许多数据增强,块归一化,和所以基本的东西。我们使用Darknet神经网络框架来训练和测试。
3 How We Do
YOLOv3 is pretty good! See table 3. In terms of COCOs weird average mean AP metric it is on par with the SSD variants but is 3× faster. It is still quite a bit behind other models like RetinaNet in this metric though.
YOLOv3非常棒!看表3。 就COCO奇怪的平均AP度量而言,准确率相当于SSD,但快3倍。可能稍稍落后于像RetinaNet的网络。
However, when we look at the “old” detection metric of mAP at IOU= .5 (or AP50 in the chart) YOLOv3 is very strong. It is almost on par with RetinaNet and far above the SSD variants. This indicates that YOLOv3 is a very strong detector that excels at producing decent boxes for objects. However, performance drops significantly as the IOU threshold increases indicating YOLOv3 struggles to get the boxes perfectly aligned with the object.
然而,当我们看到“old”检测度量IOU=.5,YOLOV3非常强大。几乎和RetinaNet媲美,远超于SSD。这说明YOLOv3是非常强大的检测器,擅长产生好的边框。然而,在阈值增加的时候,性能急速下降,表明YOLOv3在努力将边框与物体对齐。
In the past YOLO struggled with small objects. However, now we see a reversal in that trend. With the new multi-scale predictions we see YOLOv3 has relatively high APS performance. However, it has comparatively worse performance on medium and larger size objects. More investigation is needed to get to the bottom of this.
在之前YOLO检测小物体不太好。然而,现在我们看到了转机。新的多尺度预测,我们发现YOLOv3有相当高的APS表现。然而,相比可能中等或大的物体有稍稍差的效果。还需要做更多研究来提升它。
When we plot accuracy vs speed on the AP50 metric (see figure 3) we see YOLOv3 has significant benefits over other detection systems. Namely, it’s faster and better.
当我们画折线图,准确率和速度在AP50度量上(见图3)我们可以看到YOLO比其他检测系统更好。叫做,更快更好。
4. Things We Tried That Didn't Work
We tried lots of stuff while we were working on YOLOv3. A lot of it didn’t work. Here’s the stuff we can remember.
我们在YOLOv3上尝试很多东西。很多没有奏效,这里使我们记得的一些算法。
Anchor box x, y offset predictions. We tried using the normal anchor box prediction mechanism where you predict the x, y offset as a multiple of the box width or height using a linear activation. We found this formulation decreased model stability and didn’t work very well.
锚框x,y偏置预测我们试着使用正常的锚框预测机制,预测偏置x,y为使用线性激活的多个框的宽高。我们发现这个方程减少模型稳定性,所以不是太好。
Linear x, y predictions instead of logistic. We tried using a linear activation to directly predict the x, y offset instead of the logistic activation. This led to a couple point drop in mAP.
线性x,y预测而非logistic我们试着使用线性激活来直接预测偏置x,y,而不是logsitic激活。这个会导致mAP下降几个点。
Focal loss. We tried using focal loss. It dropped our mAP about 2 points. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has separate objectness predictions and conditional class predictions. Thus for most examples there is no loss from the class predictions? Or something? We aren’t totally sure.
焦点损失我们尝试了焦点损失。它会让我们的mAP下降2个点。对于焦点损失要解决的问题YOLOv3已经解决的很好了,因为它把物体预测和条件类预测分开了。对于大多数例子没有类预测损失。或者其他什么?我们不完全确定。
Dual IOU thresholds and truth assignment. Faster RCNN uses two IOU thresholds during training. If a prediction overlaps the ground truth by .7 it is as a positive example, by [.3−.7] it is ignored, less than .3 for all ground truth objects it is a negative example. We tried a similar strategy but couldn’t get good results.
双IOU阈值和真值分配Faster RCNN在训练的时候使用了两个IOU阈值。如果一个预测值覆盖了真值超过.7,它就是真值。.3-.7就忽略了,少于.3的为负样本。我们试了类似的策略,但效果不太好。
We quite like our current formulation, it seems to be at a local optima at least. It is possible that some of these techniques could eventually produce good results, perhaps they just need some tuning to stabilize the training.
我们非常喜欢我们现在的公式,至少看起来是局部最优的。可能这些技术能得到好的结果,可能我们需要微调以稳定训练。
5 What This All Means
YOLOv3 is a good detector. It’s fast, it’s accurate. It’s not as great on the COCO average AP between .5 and .95 IOU metric. But it’s very good on the old detection metric of .5 IOU.
YOLOv3是个很好的检测器。很快,准确率很高。可能在.5到.95IOU域不太好,但是在老的检测域.5IOU非常好。
Why did we switch metrics anyway? The original COCO paper just has this cryptic sentence: “A full discussion of evaluation metrics will be added once the evaluation server is complete”. Russakovsky et al report that that humans have a hard time distinguishing an IOU of .3 from .5! “Training humans to visually inspect a bounding box with IOU of 0.3 and distinguish it from one with IOU 0.5 is surprisingly
difficult.” [16] If humans have a hard time telling the difference, how much does it matter?
为什么我们交换了域?原来的COCO说了句有含义的话:一旦评估服务完成,就会增加评估域的完全讨论。Russakovsky在很难区分IOU.3到.5的报告说。训练人类这样做就很难。如果人都很难区分,又有什么意义呢?
But maybe a better question is: “What are we going to do with these detectors now that we have them?” A lot of the people doing this research are at Google and Facebook. I guess at least we know the technology is in good hands and definitely won’t be used to harvest your personal information and sell it to.... wait, you’re saying that’s exactly what it will be used for?? Oh.
但或许更好的问题是:“现在我们有了他们,我们用这些检测器来做什么呢”许多人在Google和Facebook在做这件事。我猜至少我们知道科技在好的人手里,完全不会被用来侵犯你的个人信息然后卖到。。。等等,你会说那的确是将要用来做的事?oh。
Well the other people heavily funding vision research are the military and they’ve never done anything horrible like killing lots of people with new technology oh wait.....
当然其他人把视觉研究在军事上,他们没做什么用新科技来杀害更多的人. oh .等等...
I have a lot of hope that most of the people using computer vision are just doing happy, good stuff with it, like counting the number of zebras in a national park [11], or tracking their cat as it wanders around their house [17]. But computer vision is already being put to questionable use and as researchers we have a responsibility to at least consider the harm our work might be doing and think of ways to mitigate it. We owe the world that much. In closing, do not @ me. (Because I finally quit Twitter).
我很希望大部分人用计算机视觉来做开心,好的事情,比如在国家公园数斑马,或者跟踪他们的猫,当猫在家里转到的时候。但计算机时间已经应用到有争议的应用。作为研究者,我们有义务,至少考虑我们工作的危害,可能做或者想一些办法去减轻它。我们欠世界太多。最后,不要@我,我已经完全不用twitter啦。