[论文]Learning to Summarize Web Image and Text Mutually

一、任务

1、图像文字摘要：图像->sentence描述，image classification model，PIS->ITJS

2、文字可视化：文字->图像，text categorization model？选择文本语义相近的image作为视觉表达，PTS->ITJS

二、相关研究

1、Events in still image

1）event classification：研究较少，主要集中在特定的领域，例如human activity/action分类

2）可视图像的sentence & summarization 生产：AND-OR graph等。These methods generate a direct representation of what objects exist and what is happening in a scene, and then decode it into a sentence. 文本生成依赖object recognization，但后者依然很难。

这篇paper的区别：We focus on the problem of summarizing images using high-level semantic sentences or short articles collected from the Internet, not just describing “what are there” or “what

is happening” in images.

2、Cross-Media Retrieval

1）[2,5,12,15]代表的第一代的多模检索引擎支持文本形式的query来查询不包含text metadata的image，但是这些image往往具有keyword、class labels

2）[3]是一个列外，它从image和text上学习了“latent-space”

3）更高级的系统[22,28]通过fusing features from different models into a single vector，或者先为不同的模式学习不同的模型之后再fusing outputs[16,27]

4）上述方法的输入基本都要求both image and text features

5）当前已经有方法通过将image和text映射到同一空间进而支持相关性计算[25]

三、Mutual-Summarization

1、Image Summarization

简化为检索问题Image->Sentence

进一步引入ITJS数据，I->S ≈ I->D + D->S≈I->ID + ID->TD + TD->S ,其中D=

重要假设：在ITJS空间里，属于同D的I和T是语义相关的。因此，简化为两个问题：image classification 和自动文本summarization

1）自动文本Summarization

采用MEAD[24]系统生成文本的summarization，MEAD实现了多种摘要的算法，例如position-based的，Centroid，TF*IDF，和Query-based。两个baseline的摘要方法：lead-based 和random-based，前者按顺序取cluster中所有doc的首句and依次往后的句子；对应的random方法从cluster中随机选择句子。本文采用lead-based的单个文档的摘要生成。采用的压缩比是25%。

2）Image Classification

简化为6-class的图像分类任务。分类采用的多核SVM（Multiple Kernel SVM，MK-SVM）[8,26]，并和MK-KNN和SCA[25]做了比较

a）MK-SVM

特征：optimal combination of state-of-art features and spatial pyramid levels，by using the MKL technique

模型：H（I，Y，Θ）=Sum Θi * [K（α(I)，α(Ii)），Yi]

α表示图片的特征描述，Yi是类标，K是a positive definite kernel, obtained as a liner combination of histogram kernels。

核：K（α(I)，α(Ii)）= sum 1to#α βk*k（αk(I)，αk(Ii)）

sum 1to#α βk = 1

其中，#α是描述图片的特征数，例如当有两类特征Color Histogram 和 Pyramid SIFT时，#α=2。

MKL 负责学习系数Θi 和 histogram combination weights βk ∈[0,1]

k作为核函数，考虑了3种类型（不同的判别能力和计算代价），该文的首选是histogram intersection kernel（The Histogram Intersection Kernel is also known as the Min Kernel and has been proven useful in image classification）

k(x,y) = sum{min(xi，yi)}

同时，该文也对比了RBF核和线性核来对比分类效果。

b）Multiple Kernel KNN（MK-KNN）

在KNN的基础上，对Similarity metric进行改动：s(x,y) = K(x,y)

c）Semantic Correlation Matching（SCM）

[25]使用Canonical correlation analysis（CCA，典型关联分析）来分别为图片和文本学习canonical components，wi 和wt。

3）Sentence Selection

在图片分类过程后，一个新的图片I可以被映射到ID。根据M（ID-TD）和M（TD-S），可以死得到一系列sentence S ,ranked by 与I的置信度Confidence。

Con f(x,y) = K (x,y), Conf(I,S) ≈ Conf(I,ID)

2、Text Visualization

[论文]Learning to Summarize Web Image and Text Mutually

[论文]Learning to Summarize Web Image and Text Mutually

相关阅读更多精彩内容

友情链接更多精彩内容