Paper | Open-Vocabulary Object Detection Using Captions

1 basic

  • github.com/alirezazareian/ovr-cnn
  • the first paper which proposes the task of "open-vocabulary object detection"

2 introduction

OD: each category needs thousands of bounding boxes;

stage 1: use {image, caption} pairs to learn a visual semantic space;
stage 2: use annotated boxes for several classes to train object detection;
stage 3: inference which can detect objects beyond the base classes;

to summarize, we train a model that takes an image and detects any object within a given target vocabulary VT.

Task Definition:

  1. test on target vocabulary V_{T};
  2. train on an image-caption dataset with the vocabulary as V_{C}
  3. train on an annotated object detection dataset with the vocabulary as V_{B}
  4. V_{T} is not known during training and can be any subset of the entire vocabulary V_{\omega}.

**compare with ZSD and WSD: **

  • ZSD: no V_{C};
  • WSD: no V_{B}% and need to knowV_{T}$ before training;
  • OVD is a generalization of ZSD and WSD.
image.png

outcome:

  • significant outcome the ZSD and WSD methods;

3 Method

OVD framework:

  • meaning of open: the words in the captions are not limited, but in practice, it is not literally "open" as it is limited to pretrained word embeddings. (However, word embeddings are typically trained on very large text corpora such as Wiki pedia that cover nearly every word
3.1 Learning visual semantic space
  • resembles the PixelBERT
  • use the RN50 as the visual encoder; and the BERT as the text encoder;
  • design a V2L (vision to language) module (mapping the vectors of vision patches to text vectors)
  • use the grounding (main) task to train the RN50 & V2L module.

specifically,

  1. input image --> RN50 --> features of patches
  2. each patch feature (vision) --> V2L --> patch feature (language) e^{I}_{i}
  3. caption --> Embedding e^{C}_{j}--> BERT --> features of words f^{C}_{j}
  4. patch features (language), words features --> multimodal transformer --> new features for patches and words m^{I}_{i}, m^{C}_{j}.
  5. task: perform weakly supervised grounding using {e^{I}_{i} , e^{C}_{j}}, making the paired {img, caption} be the positive, while the unpaired {img, caption} the negative, and dis between {img, caption} is calculated by average of all e^{I}_{i} and e^{C}_{j}.

the grounding objectives results in a learned visual backbone and V2L layer that can map regions in the image into words that best describe them.

besides, to teach the model learn to 1) extract all objects that might be described in captions and 2) determine what word completes the caption best, further introduce the image text matching (ITM) subtask and the Masked Language Matching (not sure about the full name) (MLM) subtask.

3.2 Learning open-vocabulary detection
  • use faster-rcnn
  1. block1-3 to extract features
  2. RPN --> predict objectness & bounding box coordinates;
  3. non-max suppression (NMS)
  4. region-of-interest pooling (ROI pooling) to get a feature map for each potential object which is typically used for classification in the supervised way;

However, in the zero-shot setting,

3.3 testing

basically the same with the training but for the last step compare the box features after V2L to the target classes.

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容