LSTM 如何处理变长序列 [英] How LSTM deal with variable length sequence

查看:25
本文介绍了LSTM 如何处理变长序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在

所以我想知道:

  • 在这个模型中,keras 如何确定 LSTM 层中 lstm_unit 的数量
  • 如何处理变长序列

补充信息:为了解释lstm_unit是什么(我不知道怎么称呼它,所以只显示它的图像):

解决方案

提供的循环层继承自基础实现 keras.layers.Recurrent,其中包括选项 return_sequences,默认为 False.这意味着默认情况下,循环层将消耗可变长度的输入,并最终在最后的顺序步骤中仅产生层的输出.

这样一来,使用None来指定一个变长的输入序列维度是没有问题的.

但是,如果您希望该层返回完整的输出序列,即输入序列每一步的输出张量,那么您必须进一步处理该输出的可变大小.

你可以通过让下一层进一步接受可变大小的输入来实现这一点,然后在你的网络中解决这个问题,最终你要么必须从一些可变长度的东西中计算一个损失函数,要么计算一些固定长度的表示,然后再继续到后面的层,具体取决于您的模型.

或者您可以通过要求固定长度的序列来实现,可能用特殊的标记值填充序列的末尾,这些标记值仅指示一个空序列项,纯粹用于填充长度.

另外,Embedding 层是一个非常特殊的层,用于处理可变长度的输入.对于输入序列的每个标记,输出形状将具有不同的嵌入向量,因此形状为(批量大小、序列长度、嵌入维度).由于下一层是 LSTM,所以这没问题……它也会很高兴地使用可变长度序列.

但是正如在Embedding的文档中提到的:

input_length:输入序列的长度,当它是常数时.如果您要连接,则此参数是必需的`Flatten` 然后是 `Dense` 上游层(没有它,就无法计算密集输出的形状).

如果您想直接从 Embedding 转到非可变长度表示,那么您必须提供固定序列长度作为层的一部分.

最后,请注意,当您表示 LSTM 层的维数时,例如 LSTM(32),您是在描述该层的输出空间的维数.

# 输入的示例序列,例如批量大小为 1.[[34],[27],...]--># 送入嵌入层[[令牌 34 的 64 维表示 ...],[令牌 27 的 64 维表示...],...]--># 输入 LSTM 层【LSTM最终序列步骤的32维输出向量】

为了避免批量大小为 1 的低效率,一种策略是根据每个示例的序列长度对输入的训练数据进行排序,然后根据常见的序列长度将其分组,例如使用自定义 Keras数据生成器.

这具有允许大批量的优点,特别是如果您的模型可能需要批量标准化或涉及 GPU 密集型训练,甚至只是为了减少批量更新梯度的噪声估计.但它仍然可以让您处理输入训练数据集,该数据集具有针对不同示例的不同批次长度.

但更重要的是,它还有一个很大的优势,您无需管理任何填充来确保输入中的公共序列长度.

I found a piece of code in Chapter 7,Section 1 of deep Deep Learning with Python as follow:

from keras.models import Model
from keras import layers
from keras import Input

text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

# Our text input is a variable-length sequence of integers.
# Note that we can optionally name our inputs!
text_input = Input(shape=(None,), dtype='int32', name='text')

# Which we embed into a sequence of vectors of size 64
embedded_text = layers.Embedding(64, text_vocabulary_size)(text_input)

# Which we encoded in a single vector via a LSTM
encoded_text = layers.LSTM(32)(embedded_text)

# Same process (with different layer instances) for the question
question_input = Input(shape=(None,), dtype='int32', name='question')
embedded_question = layers.Embedding(32, question_vocabulary_size)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

# We then concatenate the encoded question and encoded text
concatenated = layers.concatenate([encoded_text, encoded_question], axis=-1)

# And we add a softmax classifier on top
answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

# At model instantiation, we specify the two inputs and the output:
model = Model([text_input, question_input], answer)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])

as you see this model's input don't have raw data's shape information, then after Embedding layer, the input of LSTM or the output of Embedding are some variable length sequence.

So I want to know:

  • in this model, how keras to determine the number of lstm_unit in LSTM layer
  • how to deal with variable length sequence

Additional information: in order to explain what lstm_unit is (I don't know how to call it,so just show it image):

解决方案

The provided recurrent layers inherit from a base implementation keras.layers.Recurrent, which includes the option return_sequences, which defaults to False. What this means is that by default, recurrent layers will consume variable-length inputs and ultimately produce only the layer's output at the final sequential step.

As a result, there is no problem using None to specify a variable-length input sequence dimension.

However, if you wanted the layer to return the full sequence of output, i.e. the tensor of outputs for each step of the input sequence, then you'd have to further deal with the variable size of that output.

You could do this by having the next layer further accept a variable-sized input, and punt on the problem until later on in your network when eventually you either must calculate a loss function from some variable-length thing, or else calculate some fixed-length representation before continuing on to later layers, depending on your model.

Or you could do it by requiring fixed-length sequences, possibly with padding the end of the sequences with special sentinel values that merely indicate an empty sequence item purely for padding out the length.

Separately, the Embedding layer is a very special layer that is built to handle variable length inputs as well. The output shape will have a different embedding vector for each token of the input sequence, so the shape with be (batch size, sequence length, embedding dimension). Since the next layer is LSTM, this is no problem ... it will happily consume variable-length sequences as well.

But as it is mentioned in the documentation on Embedding:

input_length: Length of input sequences, when it is constant.
      This argument is required if you are going to connect
      `Flatten` then `Dense` layers upstream
      (without it, the shape of the dense outputs cannot be computed).

If you want to go directly from Embedding to a non-variable-length representation, then you must supply the fixed sequence length as part of the layer.

Finally, note that when you express the dimensionality of the LSTM layer, such as LSTM(32), you are describing the dimensionality of the output space of that layer.

# example sequence of input, e.g. batch size is 1.
[
 [34], 
 [27], 
 ...
] 
--> # feed into embedding layer

[
  [64-d representation of token 34 ...],
  [64-d representation of token 27 ...],
  ...
] 
--> # feed into LSTM layer

[32-d output vector of the final sequence step of LSTM]

In order to avoid the inefficiency of a batch size of 1, one tactic is to sort your input training data by the sequence length of each example, and then group into batches based on common sequence length, such as with a custom Keras DataGenerator.

This has the advantage of allowing large batch sizes, especially if your model may need something like batch normalization or involves GPU-intensive training, and even just for the benefit of a less noisy estimate of the gradient for batch updates. But it still lets you work on an input training data set that has different batch lengths for different examples.

More importantly though, it also has the big advantage that you do not have to manage any padding to ensure common sequence lengths in the input.

这篇关于LSTM 如何处理变长序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆