TensorFlow实现Seq2seq+Attention

实现Seq2seq+Attention有以下几个步骤：

定义RNNCell（封装了attention机制）
定义traininghelper
定义basicdecoder，传入1、2中定义好的cell和helper
将3中定义好的decoder传入dynamic_decode运行

1. attention机制

tf.contrib.seq2seq.LuongAttention

参数	含义
num_units	encoder阶段输出向量的维度hidden_size
memory	一个batch里，encoder阶段产生的所有的特征向量，维数为[batch_size, max_time, num_units]
memory_sequence_length	记录memory中的特征向量的长度，维数是[batch_size,]，令memory中超过memory_sequence_length的值为0
probability_fn	将打分函数直接转成概率，默认的是softmax
score_mask_value	在将分数传到probability_fn函数之前的掩码值，在有Probability_fn函数的情况下才用

tf.contrib.seq2seq.AttentionWrapper

# 分为三步
# 第一步是定义attention机制
# 第二步是定义要是用的基础的RNNCell
# 第三步是使用AttentionWrapper进行封装 

#定义要使用的attention机制。
attention_mechanism=tf.contrib.seq2seq.LuongAttention(
num_units=self.rnn_size, 
memory=encoder_outputs,
memory_sequence_length=encoder_inputs_length)
# 定义decoder阶段要是用的LSTMCell，然后为其封装attention wrapper
decoder_cell = self._create_rnn_cell()
decoder_cell = tf.contrib.seq2seq.AttentionWrapper(
cell=decoder_cell, 
attention_mechanism=attention_mechanism, attention_layer_size=self.rnn_size, 
name='Attention_Wrapper')

2. helper

用来确定decoder部分的输入，训练过程应直接使用上一时刻的真实值作为下一时刻输入，预测过程中可以使用贪婪的方法选择概率最大的那个值作为下一时刻的输入

tf.contrib.seq2seq.TrainingHelper

helper = tf.contrib.seq2seq.TrainingHelper(
input=input_vectors,
sequence_length=input_lengths)
input大小为[batch_size, step, 每个时间步输入的长度]，
sequence_length长度为batch_size，表示每个输入的真实长度，因为有些是padding的

GreedyEmbeddingHelper(embedding, start_tokens, end_token)

https://blog.csdn.net/tudaodiaozhale/article/details/99335220
TrainingHelper并不需要这个，是因为在训练阶段，我们给TrainingHelper的就是[batch_size, seq_len, embed_size]的输入，已经是词向量了。而在推理阶段，我们只给了一个开始符，给了我们需要的句子长度，所以我们在输出一个词的时候还需要进行embedding_lookup成词向量作为下一个时刻的输入。

3. basic_decoder

tf.contrib.seq2seq.BasicDecoder(cell, helper, initial_state, output_layer)

basic_decoder文件定义了一个基本的Decoder类实例BasicDecoder
cell：RNNCell，也就是decode阶段的神经元，一般是封装了AttentionWrapper的RNNCell
如：decoder_cell = tf.nn.rnn_cell.LSTMCell(config.hidden_size)
helper：Helper类，用于确定训练和推理阶段decoder输入的内容
initial_state：初始状态，一般是用Encoder的最后一个隐层状态，也就是标准Seq2seq的做法
output_layer：输出层，是一个tf.layers.Layer的对象

4. tf.contrib.seq2seq.dynamic_decode

final_outputs, final_state, final_sequence_lengths = tf.contrib.seq2seq.dynamic_decode(
   decoder,
   output_time_major=False,
   impute_finished=False,
   maximum_iterations=None,
   parallel_iterations=32,
   swap_memory=False,
   scope=None
   )

decoder: 一般为定义好的BasicDecoder、BeamSearchDecoder或者自己定义的decoder类对象
impute_finished: Boolean，为真时会拷贝最后一个时刻的状态并将输出置零，程序运行更稳定，使最终状态和输出具有正确的值，在反向传播时忽略最后一个完成步。但是会降低程序运行速度。
maximum_iterations: 最大解码步数，一般训练设置为decoder_inputs_length，预测时设置一个想要的最大序列长度即可。程序会在产生<eos>或者到达最大步数处停止。

final_outputs是一个namedtuple，tf.contrib.seq2seq.BasicDecoderOutput类型，里面包含两项(rnn_outputs, sample_id)
rnn_output: [batch_size, decoder_targets_length, vocab_size]，保存decode每个时刻每个单词的概率，可以用来计算loss
sample_id: [batch_size], tf.int32，保存最终的编码结果。可以表示最后的答案
maximum_iterations=self.max_target_sequence_length)

mask矩阵 tf.sequence_mask

对tensor进行mask，返回True和False组成的tensor

在NLP中，一个常见的问题是输入序列长度不等，也就是说我们文本中的每句话都是长短不一的，而mask可以帮助我们处理这个问题。虽然RNN等模型可以处理不定长的input，但是在实践中，需要对input做batchsize，转换成固定大小的tensor，方便矩阵操作.
s1: I like cats.
s2: He does not like cats.
假设默认的seq_len是5，一般会对s1做pad处理，也就是把它填充到长度为5，变成：
I like cats <PAD> <PAD>
在上述例子数字编码后，开始做embedding，而pad也会有embedding向量，但pad本身没有实际意义，参与训练可能还是有害的。因此，有必要通过一个mask tensor来记录哪些是真实的value，上述例子的两个mask如下：
1 1 1 0 0
1 1 1 1 1

tf.contrib.seq2seq.sequence_losss(training_logits,targets,masks)

training_logits：输出层的结果
targets：目标值
masks：使用tf.sequence_mask计算的结果，在这里作为权重，也就是说我们在计算交叉熵时不会把<PAD>计算进去。

tf.argmax

能给出某个tensor对象在某一维上的其数据最大值所在的索引值，常用于metric（如acc）的计算
当axis=0时，返回每一列最大值的索引

decoder_logits=[
    [
        [1,2,3,4],
        [2,3,4,5]
    ],[
        [6,7,8,9],
        [7,8,9,0]
    ]
]
out = tf.argmax(decoder_logits, 2)
with tf.Session() as sess:
    print(sess.run(out))

>>>
[[3 3]
 [3 2]]

tf.nn.embedding_lookup（params, ids）

根据input_ids中的id，寻找params中的第id行
https://www.cnblogs.com/gaofighting/p/9625868.html

tf.nn.bidirectional_dynamic_rnn 双向RNN

((encoder_fw_outputs, encoder_bw_outputs),
(encoder_fw_final_state, encoder_bw_final_state)) = tf.nn.bidirectional_dynamic_rnn

outputs为(output_fw, output_bw)，是一个包含前向cell输出tensor和后向cell输出tensor组成的二元组。假设 time_major=false, 而且tensor的shape为[batch_size, max_time, depth]。实验中使用tf.concat(outputs, 2)将其拼接。
output_states为(output_state_fw, output_state_bw)，包含了前向和后向最后的隐藏状态的组成的二元组。
output_state_fw和output_state_bw的类型为LSTMStateTuple。
LSTMStateTuple由（c，h）组成，分别代表memory cell和hidden state。

报错：list indices must be integers or slices, not tuple

需要使用np.array转换成矩阵

tf.concat() 张量拼接

https://blog.csdn.net/leviopku/article/details/82380118

numpy数组切片

这是numpy的切片操作，一般结构如num[a:b,c:d]，分析时以逗号为分隔符
逗号之前为要取的num行的下标范围(a到b-1)，逗号之后为要取的num列的下标范围(c到d-1)
前面是行索引，后面是列索引
https://blog.csdn.net/qq_41375609/article/details/95027651

seq_targets = [
  [3,7,2],
  [1,2,0],
  [9,2,0]
]
seq_targets = np.array(seq_targets)
seq_targets = seq_targets[:,:-1]
print(seq_targets)
decoder_inputs = tf.concat([tf.reshape(tokens_go,[-1,1]), seq_targets], 1)

>>>
[[3 7]
[1 2]
[9 2]]

[[1 3 7]
[1 1 2]
[1 9 2]]