LSTM原理、源码、Demo及习题

全面整理LSTM相关原理，源码，以及开发demo，设计习题。如转载请注明转载出处。

LSTM 框架

lstm.png

lstm 由3个门和一个当前细胞输出值也就是 $\tilde{C}_t$ 来控制输入

遗忘门 - 表示 $h_{t-1}$ 有多少被遗忘. $f_t = \sigma (W_{if}x_t + b_if + W_{hf}h_{(t-1)} + b_hf) = \sigma (W_{f}[x_t,h_{(t-1)}] + b_f)$
输入门 - 表示当前时刻有多少被保存下来。 $i_t = \sigma (W_{ii}x_t + b_ii + W_{hi}h_{(t-1)} + b_hi) = \sigma (W_{i}[x_t,h_{(t-1)}] + b_i)$
输出门 - 控制多少信息被输出到隐层。 $o_t = \sigma (W_{io}x_t + b_io + W_{ho}h_{(t-1)} + b_ho) = \sigma (W_{o}[x_t,h_{(t-1)}] + b_o)$
当前Cell输出 $\tilde{C}_t$ 用 $g_t 表示$ - 当前t时刻的实际 cell结果, $\tilde{C}_t = g_t = tanh(W_{ig}x_t + b_ig + W_{hg}h_{(t-1)} + b_hg) = tanh(W_{g}[x_t,h_{(t-1)}] + b_g)$

通过3个门以及 $\tilde{C}_t = g_t$ 计算如下:

$c_t$ - 当前细胞输出。 $c_t = f_{t}*c_{t-1} + i_{t}*g_t$
$h_t$ - 隐层输出。 $h_t = o_{t}*tanh(c_t)$

参数计算

一共3个门加一个 $g_t$ ，所以一共 4组参数 $(W_f, b_f), (W_i, b_i), (W_o, b_o), (W_g, b_g)$ , $[x_t,h_{(t-1)}]$ size是 input_size+hidden_size, 因为 $W * [x_t,h_{(t-1)}]$ 输出的维度是与 $h_t$ 一样的也就是 hidden_size, 所以 $W$ 的维度是 (input_size+hidden_size, hidden_size). $b$ 的size就是 hdden_size. 所以总共的参数量就是:

 4 * ((input_size+hidden_size) * hidden_size + hidden_size)

GRU

gru.png

GRU由2个门以及一个隐层输出值 $\tilde{h}_t$ 也叫做 $n_t$ 来控制最终的 $h_t$

$r_t$ - 重置门. $r_t = \sigma (W_{ir}x_t + b_ir + W_{hr}h_{(t-1)} + b_hr) = \sigma (W_r[x_t, h_{(t-1)}] + b_r)$
$z_t$ - 更新门. $z_t = \sigma (W_{iz}x_t + b_iz + W_{hz}h_{(t-1)} + b_hz) = \sigma (W_z[x_t, h_{(t-1)}] + b_z)$
$\tilde{h}_t$ - 隐层细胞输出. $\tilde{h}_t = n_t = tanh(W_{in}x_t + b_in + r_t*(W_{hn}h_(t-1) + b_hn)) = tanh(W_{n}[x_t, r_t*h_{(t-1)}] + b_n)$
$h_t$ - 最后输出. $h_t = (1-z_{t})*h_{t-1} + z_{t}*\tilde{h}_t = (1-z_{t})*h_{t-1} + z_{t}*n_t$

从上面来看，GRU 实际上比LSTM 少了一组参数。在数据量较大的时候使用lstm，而数据量较少使用GRU. 同时GRU不像lstm，有 $c_t, h_t$ 两个输出，GRU只有一个 $h_t$

pytorch lstm 解析

$\begin{array}{ll} \\ i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\ f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\ o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\ c_t = f_t * c_{(t-1)} + i_t * g_t \\ h_t = o_t * \tanh(c_t) \\ \end{array}$

Parameters

input_size – The number of expected features in the input x
hidden_size – The number of features in the hidden state h
num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False
dropout – If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. Default: 0
bidirectional – If True, becomes a bidirectional LSTM. Default: False

Inputs: input, (h_0, c_0)

input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence or torch.nn.utils.rnn.pack_sequence for details.
h_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. If the LSTM is bidirectional, num_directions should be 2, else it should be 1.
c_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial cell state for each element in the batch. If (h_0, c_0) is not provided, both h_0 and c_0 default to zero.

Outputs: output, (h_n, c_n)

output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the LSTM, for each t. If a :class:torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively. Similarly, the directions can be separated in the packed case. 从shape来看, 实际上这是所有 h_t 的输出. 所以对于 seq2seq 来说这个结果就够用了。

h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len. Like output, the layers can be separated using h_n.view(num_layers, num_directions, batch, hidden_size) and similarly for c_n. 从这个shaple num_layers * num_directions 来看，说明这是每一层的最后一个输出。对于最上面的一层来说，这个是包含最后一个输出的。也就是 -1, batch, hidden_size 就是最后一个输出。特别注意，即使设置了 batch _first=True, h_n 的维度依然是 num_layers * num_directions, batch, hidden_size 如果要转换需要 h_n.transpose(0, 1) 来转换到 batch, num_layers * num_directions, hidden_size

c_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the cell state for t = seq_len. 与 h_n 类似, 是每一层的 cell_state 输出。

变长序列处理

我们知道 lstm, 处理的是定长序列，那么，对于nlp来说，变长序列该如何处理?

对于变长序列处理，采用了:

pack_padded_sequence: 将变长序列打包
pad_packed_sequence: 将打包的结果解包

pack_padded_sequence

在理解这两个函数之前，我们看看lstm是如何进行运行的。

No.	$w_0$	$w_1$	$w_2$	$w_3$	$w_4$
0	I	love	mom	'	cooking
1	Yes	0	0	0	0
2	No	way	0	0	0
3	I	love	you	too	!
4	This	is	the	shit	0

一共有5个序列，长度各种各样。其中的 0 表示的是padding. 因为一个batch必须是有同样维度的才可以。

$shape=BatchSize \times SequeceLength \times HiddenDim=1 \times 5 \times *$

对于lstm的实际运行中会将长度一样的放在，这样能够批量运行同一批。所以会按照长度进行排序。重新排列的结果如下:

No.	$w_0$	$w_1$	$w_2$	$w_3$	$w_4$
0	I	love	mom	'	cooking
3	I	love	you	too	!
4	This	is	the	shit
2	No	way
1	Yes

这个过程就是pack的过程。pack之后，会重新将长度一样的放在一起，因为是长度一样的放在一起，那么，也就是将padding的0全部去掉后的排列结果。

示例图:

pack_pad.jpg

转载链接

batch_sizes: 看着其中不同的颜色。绿色:5, 橘色: 4 以此类推。那么, pytorch在实际运算的时候是如何运算的呢？

批量运算，会一次性将上面所有数据进行运算。大体流程:

设置循环步数，为 max_length=batch_sizes[0], 这里就是5
开始循环 i=0:
- 设置输入为 batch_size[i] 进入 lstm cell 批量运算. (i=0 时是绿色的5个, i=1时是橘色的4个，一次类推)
- i = i + 1

这样每一次处理实际上是运算递减的，同时，进行的也是序列实际长度的lstm运算。（PS: 早期，因为看到padding, 所以以为会将padding一起运算，这样就不用进行pack了，但是, 这样会增加运算量，同时, 对于最后一个输出 $h_n, c_n$ 不是实际序列最后一个输出，而是 padding后的输出。而看了pytorch源码后，理解，在每一次的 lstm cell 运算，会重新取batch, 而这个batch是变化，与实际sequence长度一致. 从这个角度来看，我觉得之所以pack，对长度排序，是为了方便每一次 lstm cell 取batch 方便运算; 如果不排序，每一次通过mask取会在lstm循环运算的时候效率较低)

pytorch c++ 源码 aten/src/ATen/native/RNN.cpp:

template<typename hidden_type, typename cell_params>

struct PackedLayer : Layer<PackedSequence, hidden_type, cell_params> {

using output_type = typename Layer<PackedSequence, hidden_type, cell_params>::output_type;

  PackedLayer(Cell<hidden_type, cell_params>& cell)
    : cell_(cell) {};

output_type operator()(

    const PackedSequence& input, 
    const hidden_type& input_hidden, 
    const cell_params& params) const override
{
    std::vector<at::Tensor> step_outputs;
    
    std::vector<hidden_type> hiddens;
    int64_t input_offset = 0;
    int64_t num_steps = input.batch_sizes.size(0);
    int64_t* batch_sizes = input.batch_sizes.data<int64_t>();
    int64_t last_batch_size = batch_sizes[0];

    // Batch sizes is a sequence of decreasing lengths, which are offsets
    // into a 1D list of inputs. At every step we slice out batch_size elements,
    // and possibly account for the decrease in the batch size since the last step,
    // which requires us to slice the hidden state (since some sequences
    // are completed now). The sliced parts are also saved, because we will need
    // to return a tensor of final hidden state.
    auto hidden = input_hidden;
    for (int64_t i = 0; i < num_steps; ++i) {
      int64_t batch_size = batch_sizes[i];
      auto step_input = input.data.narrow(0, input_offset, batch_size);
      input_offset += batch_size;

      int64_t dec = last_batch_size - batch_size;
      if (dec > 0) {
        hiddens.push_back(hidden_slice(hidden, last_batch_size - dec, last_batch_size));
        hidden = hidden_slice(hidden, 0, last_batch_size - dec);
      }

      last_batch_size = batch_size;
      hidden = cell_(step_input, hidden, params);
      step_outputs.push_back(hidden_as_output(hidden));
    }
    hiddens.push_back(hidden);
    std::reverse(hiddens.begin(), hiddens.end());

    return { PackedSequence{ at::cat(step_outputs, 0), input.batch_sizes }, hidden_concat(hiddens) };
  }

  Cell<hidden_type, cell_params>& cell_;
};

解释:

num_steps: 就是最长的batch 其实也就是最长的sequence length
input.data.narrow(0, input_offset, batch_size): 从 batch_size 中取每一步 lstm cell 要运算的 所有sequence $x_t$ , 对应到代码就是 $x_{input\_offset}$
dec = last_batch_size - batch_size: 对于其他没有参与到 lstm cell 运算的，用0来补上，保证lstm运算后，所有的 sequence hidden layer 长度是一样的。

结论: 在pack后的变长序列，运算每一步都是有效运算。所以在来看 lstm 的输出

output: shape (seq_len, batch, num_directions * hidden_size) 是包含padding的序列长度(padding的为0)
h_n: shape (num_layers * num_directions, batch, hidden_size), 实际的sequence 长度计算的lstm 最后一个隐层，与padding无关. 从shape看是所有layer的最后一个隐层.
c_n: shape (num_layers * num_directions, batch, hidden_size), 实际的sequence 长度计算的 lstm 最后一个 cell state 与padding无关. 从shape看是所有layer的最后一层

enforce_sorted 参数

这个参数特别说明一下, 默认是 True, 也就是说输入的batch sequence 必须是按照长度降序排好序的。

如果这个参数是 False, 那么，这个排序的工作会由 pack_padded_sequence 来做。

lstm demo

下面的demo包含了前面说的参数设置。

import torch
from torch.nn.modules.rnn import LSTM
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

X_SORTED = torch.tensor([
    [
        [1, 2], [3, 4], [0, 0]
    ],
    [
        [5, 6], [7, 8], [0, 0]
    ],
    [
        [9, 10], [0, 0], [0, 0]
    ]


], dtype=torch.float)

X_UNSORTED = torch.tensor([
    [
        [1, 2], [3, 4], [0, 0]
    ],

    [
        [9, 10], [0, 0], [0, 0]
    ],
    [
        [5, 6], [7, 8], [0, 0]
    ]
], dtype=torch.float)

sequence_length_sorted = torch.tensor([2, 2, 1], dtype=torch.long)
sequence_length_unsorted = torch.tensor([2, 1, 2], dtype=torch.long)


def demo_pack_padded_sequence(x, sequence_length, is_sorted):
    print(f"x: {x.numpy()}")
    print(f"Batch size: {x.shape[0]}, "
          f"Sequence length: {x.shape[1]}, "
          f"hidden dim: {x.shape[2]}")

    print(f"sequence length: {sequence_length.numpy()}")

    pack = pack_padded_sequence(input=x,
                                lengths=sequence_length,
                                batch_first=True,
                                enforce_sorted=is_sorted)

    print(f"pack: {pack}")

    pad = pad_packed_sequence(sequence=pack, batch_first=True)
    print(f"pad: {pad}")

    lstm = LSTM(input_size=x.shape[-1],
                hidden_size=4,
                num_layers=1,
                batch_first=True,
                bidirectional=False)
    output, (hn, cn) = lstm(pack)

    print("output", "-" * 80)
    print(output)

    pad_output = pad_packed_sequence(output, batch_first=True, padding_value=0.0)
    print("+" * 80)
    print(f"output: {pad_output}")
    print(f"hn: {hn}")
    print(f"cn: {cn}")
    print(f"output[:-1:]: {output[:-1:]}")

demo_pack_padded_sequence(x=X_SORTED,
                              sequence_length=sequence_length_sorted,
                              is_sorted=True)

x: [[[ 1.  2.]
  [ 3.  4.]
  [ 0.  0.]]

 [[ 5.  6.]
  [ 7.  8.]
  [ 0.  0.]]

 [[ 9. 10.]
  [ 0.  0.]
  [ 0.  0.]]]
Batch size: 3, Sequence length: 3, hidden dim: 2
sequence length: [2 2 1]
pack: PackedSequence(data=tensor([[ 1.,  2.],
        [ 5.,  6.],
        [ 9., 10.],
        [ 3.,  4.],
        [ 7.,  8.]]), batch_sizes=tensor([3, 2]), sorted_indices=None, unsorted_indices=None)
pad: (tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [ 0.,  0.]]]), tensor([2, 2, 1]))
output --------------------------------------------------------------------------------
PackedSequence(data=tensor([[-0.0644,  0.1670,  0.1466,  0.0274],
        [-0.0926,  0.1344,  0.4401,  0.0731],
        [-0.0940,  0.1009,  0.5912,  0.0238],
        [-0.1254,  0.2073,  0.3820,  0.0958],
        [-0.1485,  0.1642,  0.5456,  0.0616]], grad_fn=<CatBackward>), batch_sizes=tensor([3, 2]), sorted_indices=None, unsorted_indices=None)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
output: (tensor([[[-0.0644,  0.1670,  0.1466,  0.0274],
         [-0.1254,  0.2073,  0.3820,  0.0958]],

        [[-0.0926,  0.1344,  0.4401,  0.0731],
         [-0.1485,  0.1642,  0.5456,  0.0616]],

        [[-0.0940,  0.1009,  0.5912,  0.0238],
         [ 0.0000,  0.0000,  0.0000,  0.0000]]], grad_fn=<TransposeBackward0>), tensor([2, 2, 1]))
hn: tensor([[[-0.1254,  0.2073,  0.3820,  0.0958],
         [-0.1485,  0.1642,  0.5456,  0.0616],
         [-0.0940,  0.1009,  0.5912,  0.0238]]], grad_fn=<StackBackward>)
cn: tensor([[[-0.1698,  0.4718,  0.7097,  0.4341],
         [-0.1641,  0.3882,  0.8470,  1.2613],
         [-0.0996,  0.2737,  0.8373,  0.9169]]], grad_fn=<StackBackward>)
output[:-1:]: (tensor([[-0.0644,  0.1670,  0.1466,  0.0274],
        [-0.0926,  0.1344,  0.4401,  0.0731],
        [-0.0940,  0.1009,  0.5912,  0.0238],
        [-0.1254,  0.2073,  0.3820,  0.0958],
        [-0.1485,  0.1642,  0.5456,  0.0616]], grad_fn=<CatBackward>), tensor([3, 2]), None)

特别提示: hn 的结果是包含在 output 中的, 如何从 output 中提取出 hn 参考习题4

lstm 应用

从前面阐述，明白了lstm实际的原理和输出。那么，在实际应用的时候，如果是使用每一个时间步的隐层进行运算, 那么，要注意将mask进入运算，因为输出的隐层是包含padding部分的。当然可以利用,padding的值全是0，是有些便利计算方法的，但是不推荐，要使用mask运算。

如果是使用lstm最后一个隐层的输出，那么，直接使用就可以了。

习题

sequence 长度是 $100$ , $x_t$ embedding 维度是 $200$ , 隐层输出维度是 $300$ , 计算 lstm 参数是多少?
lstm 对变长序列padding，在实际计算lstm cell的时候 padding 部分是否参与计算? 如果不参与计算，lstm是如何进行变长计算的?
lstm 输出 output(也就是每个时间步的hidden输出) 是否包含 h_n 输出？如果不包含，请说明情况?
使用h_n, 如何提取最后一个最后的最后一层的 hidden 输出（最后的hidden输出常常作为整个句子的编码结果); 在不使用 h_n 的情况下, 使用 lstm 输出 output(也就是每个时间步的hidden输出), 如何提取出最后一个最后一层的 hidden 输出?
是否注意到了mask的使用？

最后编辑于：2020.04.26 19:59:10

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 199,393评论 5赞 467
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 83,790评论 2赞 376
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 146,391评论 0赞 330
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 53,703评论 1赞 270
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,613评论 5赞 359
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,003评论 1赞 275
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,507评论 3赞 390
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,158评论 0赞 254
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,300评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,256评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,274评论 1赞 328
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,984评论 3赞 316
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,569评论 3赞 303
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,662评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,899评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,268评论 2赞 345
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 41,840评论 2赞 339