GRU and caption system using 'Keras'

  1. ** Image caption prediction system design**: merge output of ConvNets with outputs Gated Recurrent Units, followed by another GRU that does not output sequence. While the output of previous GRU is sequenced-based.
  2. Attention: Understand why merging procedure is followed by a GRU: Since we used:
language_model.add(GRU(output_dim=128, return_sequences=True))

Therefore the output of GRU is also a sequence, thus the output of ConvNets should also based on sequence:

image_model.add(RepeatVector(max_caption_len))

Then, the merging procedure is followed by another GRU because the input is a sequence.

  1. max_caption_len represents length of sequence. GRU or LSTM(Recurrent based module) is after embedding module:
Embedding(vocab_size, 256, input_length=max_caption_len)

Here vocab_size represents the number of feature(dimension), 256 represents number of hidden size.

  1. Questions:
    (1) What if we want to apply different learning hyper-parameters for these 2 sub-models?
    (2) Why GRU instead of LSTM? What is the advantage of GRU?
  2. Conclusion: Architecture for learning image captions with a convnet and a Gated Recurrent Unit:
    (word-level embedding, caption of maximum length 16 words).
    Note that getting this to work well will require using a bigger convnet, initialized with pre-trained weights.(Yep, we could use pretrained weight here!)
max_caption_len = 16
vocab_size = 10000
# first, let's define an image model that
# will encode pictures into 128-dimensional vectors.
# it should be initialized with pre-trained weights.
image_model = Sequential()
image_model.add(Convolution2D(32, 3, 3, border_mode='valid', input_shape=(3, 100, 100)))
image_model.add(Activation('relu'))
image_model.add(Convolution2D(32, 3, 3))
image_model.add(Activation('relu'))
image_model.add(MaxPooling2D(pool_size=(2, 2)))
image_model.add(Convolution2D(64, 3, 3, border_mode='valid'))
image_model.add(Activation('relu'))
image_model.add(Convolution2D(64, 3, 3))
image_model.add(Activation('relu'))
image_model.add(MaxPooling2D(pool_size=(2, 2)))
image_model.add(Flatten())
image_model.add(Dense(128))
# let's load the weights from a save file.image_model.load_weights('weight_file.h5')
# next, let's define a RNN model that encodes sequences of words
# into sequences of 128-dimensional word vectors.
language_model = Sequential()
language_model.add(Embedding(vocab_size, 256, input_length=max_caption_len))
language_model.add(GRU(output_dim=128, return_sequences=True))
language_model.add(TimeDistributedDense(128))
# let's repeat the image vector to turn it into a sequence.
image_model.add(RepeatVector(max_caption_len))
# the output of both models will be tensors of shape (samples, max_caption_len, 128).
# let's concatenate these 2 vector sequences.
model = Sequential()
model.add(Merge([image_model, language_model], mode='concat', concat_axis=-1))
# let's encode this vector sequence into a single vectormodel.add(GRU(256, return_sequences=False))
# which will be used to compute a probability
# distribution over what the next word in the caption should be!
model.add(Dense(vocab_size))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
# "images" is a numpy float array of shape (nb_samples, nb_channels=3, width, height).
# "captions" is a numpy integer array of shape (nb_samples, max_caption_len)
# containing word index sequences representing partial captions.
# "next_words" is a numpy float array of shape (nb_samples, vocab_size)
# containing a categorical encoding (0s and 1s) of the next word in the corresponding
# partial caption.
model.fit([images, partial_captions], next_words, batch_size=16, nb_epoch=100)
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 小的时候很懂事,心疼妈妈,跟在妈妈身后等她需要我的时候马上提供帮助,从那之后就开始被长大了。 我的妹妹跟我差一岁,...
    adf2b2aaddfe阅读 733评论 0 1
  • 我相信这个世界上有鬼 也不是鬼,准确的说是平行世界 不管你相不相信,在这个地球上存在着至少两个平行世界。我能感觉的...
    离人袖阅读 254评论 0 0