1 basic

github.com/alirezazareian/ovr-cnn
the first paper which proposes the task of "open-vocabulary object detection"

2 introduction

OD: each category needs thousands of bounding boxes;

stage 1: use {image, caption} pairs to learn a visual semantic space;
stage 2: use annotated boxes for several classes to train object detection;
stage 3: inference which can detect objects beyond the base classes;

to summarize, we train a model that takes an image and detects any object within a given target vocabulary VT.

Task Definition:

test on target vocabulary $V_{T}$ ;
train on an image-caption dataset with the vocabulary as $V_{C}$
train on an annotated object detection dataset with the vocabulary as $V_{B}$
$V_{T}$ is not known during training and can be any subset of the entire vocabulary $V_{\omega}$ .

**compare with ZSD and WSD: **

ZSD: no $V_{C}$ ;
WSD: no $V_{B}% and need to know$ V_{T}$ before training;
OVD is a generalization of ZSD and WSD.

image.png

outcome：

significant outcome the ZSD and WSD methods;

3 Method

OVD framework：

meaning of open: the words in the captions are not limited, but in practice, it is not literally "open" as it is limited to pretrained word embeddings. (However, word embeddings are typically trained on very large text corpora such as Wiki pedia that cover nearly every word

3.1 Learning visual semantic space

resembles the PixelBERT
use the RN50 as the visual encoder; and the BERT as the text encoder;
design a V2L (vision to language) module (mapping the vectors of vision patches to text vectors)
use the grounding (main) task to train the RN50 & V2L module.

specifically,

input image --> RN50 --> features of patches
each patch feature (vision) --> V2L --> patch feature (language) $e^{I}_{i}$
caption --> Embedding $e^{C}_{j}$ --> BERT --> features of words $f^{C}_{j}$
patch features (language), words features --> multimodal transformer --> new features for patches and words $m^{I}_{i}$ , $m^{C}_{j}$ .
task: perform weakly supervised grounding using { $e^{I}_{i}$ , $e^{C}_{j}$ }, making the paired {img, caption} be the positive, while the unpaired {img, caption} the negative, and dis between {img, caption} is calculated by average of all $e^{I}_{i}$ and $e^{C}_{j}$ .

the grounding objectives results in a learned visual backbone and V2L layer that can map regions in the image into words that best describe them.

besides, to teach the model learn to 1) extract all objects that might be described in captions and 2) determine what word completes the caption best, further introduce the image text matching (ITM) subtask and the Masked Language Matching (not sure about the full name) (MLM) subtask.

3.2 Learning open-vocabulary detection

use faster-rcnn

block1-3 to extract features
RPN --> predict objectness & bounding box coordinates;
non-max suppression (NMS)
region-of-interest pooling (ROI pooling) to get a feature map for each potential object which is typically used for classification in the supervised way;

However, in the zero-shot setting,

3.3 testing

basically the same with the training but for the last step compare the box features after V2L to the target classes.

Paper ｜ Open-Vocabulary Object Detection Using Captions