了解有状态的LSTM [英] Understanding stateful LSTM

查看:146
本文介绍了了解有状态的LSTM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读教程在RNN/LSTM上,我很难理解有状态的LSTM.我的问题如下:

I'm going through this tutorial on RNNs/LSTMs and I'm having quite a hard time understanding stateful LSTMs. My questions are as follows :

RNNs 上的Keras文档中,我发现<批次中位于第c0>个位置的样品将作为下一个批次中位于第i个位置的样品的输入隐藏状态.这是否意味着如果我们要在样本之间传递隐藏状态,就必须使用大小为1的批次,并因此执行在线梯度下降?有没有办法在一批大于1的批处理中传递隐藏状态并对该批处理执行梯度下降?

In the Keras docs on RNNs, I found out that the hidden state of the sample in i-th position within the batch will be fed as input hidden state for the sample in i-th position in the next batch. Does that mean that if we want to pass the hidden state from sample to sample we have to use batches of size 1 and therefore perform online gradient descent? Is there a way to pass the hidden state within a batch of size >1 and perform gradient descent on that batch ?

在本教程的从一个字符到一个字符的映射的有状态LSTM"段落中,提供了一个代码,该代码使用batch_size = 1stateful = True来学习预测给定字母的下一个字母.在代码的最后部分(第53行到完整代码的末尾),从随机字母('K')开始测试模型,并预测'B',然后在给定'B'的情况下预测'C',依此类推除了"K"外,它似乎运行良好.但是,我尝试对代码进行以下调整(最后一部分,我保留了第52行及以上):

In the tutorial's paragraph 'Stateful LSTM for a One-Char to One-Char Mapping' were given a code that uses batch_size = 1 and stateful = True to learn to predict the next letter of the alphabet given a letter of the alphabet. In the last part of the code (line 53 to the end of the complete code), the model is tested starting with a random letter ('K') and predicts 'B' then given 'B' it predicts 'C', etc. It seems to work well except for 'K'. However, I tried the following tweak to the code (last part too, I kept lines 52 and above):

    # demonstrate a random starting point
    letter1 = "M"
    seed1 = [char_to_int[letter1]]
    x = numpy.reshape(seed, (1, len(seed), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    print(int_to_char[seed1[0]], "->", int_to_char[index])
    letter2 = "E"
    seed2 = [char_to_int[letter2]]
    seed = seed2
    print("New start: ", letter1, letter2)
    for i in range(0, 5):
        x = numpy.reshape(seed, (1, len(seed), 1))
        x = x / float(len(alphabet))
        prediction = model.predict(x, verbose=0)
        index = numpy.argmax(prediction)
        print(int_to_char[seed[0]], "->", int_to_char[index])
        seed = [index]
    model.reset_states()

and these outputs:

    M -> B
    New start: M E
    E -> C
    C -> D
    D -> E
    E -> F

It looks like the LSTM did not learn the alphabet but just the positions of the letters, and that regardless of the first letter we feed in, the LSTM will always predict B since it's the second letter, then C and so on.

因此,假设在测试过程中,例如如果我们以字母"K"开头,则字母A至J不会被保留,那么将先前的隐藏状态保留为当前隐藏状态的初始隐藏状态将如何帮助我们进行学习呢?之前喂食,初始的隐藏状态与训练期间的状态会不同吗?

Therefore, how does keeping the previous hidden state as initial hidden state for the current hidden state help us with the learning given that during test if we start with the letter 'K' for example, letters A to J will not have been fed in before and the initial hidden state won't be the same as during training ?

我想在整本书上训练我的LSTM,以学习如何生成句子,也许还可以学习作者的风格,我如何自然地在该文本上训练我的LSTM(输入整个文本,并让LSTM找出依赖项而不是人工"从那本书自己创建成批的句子来训练我的LSTM?我相信我应该使用有状态的LSTM可以有所帮助,但我不确定如何使用.

I want to train my LSTM on a whole book to learn how to generate sentences and perhaps learn the authors style too, how can I naturally train my LSTM on that text (input the whole text and let the LSTM figure out the dependencies between the words) instead of having to 'artificially' create batches of sentences from that book myself to train my LSTM on? I believe I should use stateful LSTMs could help but I'm not sure how.

推荐答案

  1. 在Keras中具有状态LSTM意味着Keras变量将用于存储和更新状态,实际上,您可以随时检查状态向量的值(即,直到您致电reset_states()).另一方面,无状态模型每次处理批处理时都会使用初始零状态,就好像您总是在train_on_batchtest_on_batchpredict_on_batch之后调用reset_states()一样.关于有状态模型下一批将重用状态的解释与无状态模型的区别差不多.当然,状态将始终在批处理中流动,并且您不需要需要具有大小为1的批处理才能发生.我看到有状态模型有用的两种情况:

  1. Having a stateful LSTM in Keras means that a Keras variable will be used to store and update the state, and in fact you could check the value of the state vector(s) at any time (that is, until you call reset_states()). A non-stateful model, on the other hand, will use an initial zero state every time it processes a batch, so it is as if you always called reset_states() after train_on_batch, test_on_batch and predict_on_batch. The explanation about the state being reused for the next batch on stateful models is just about that difference with non-stateful; of course the state will always flow within the batch and you do not need to have batches of size 1 for that to happen. I see two scenarios where stateful models are useful:

  • 您希望对数据的分割序列进行训练,因为这些序列很长,并且对它们的整个长度进行训练是不切实际的.
  • 在预测时间上,您想要检索序列中每个时间点的输出,而不仅仅是在结尾处检索(要么是因为您想要将其反馈到网络中,要么是因为您的应用程序需要它).我亲自在导出以供以后集成的模型中进行此操作(这些模型是批处理大小为1的训练模型的副本").

我同意字母RNN的示例在实践中似乎并不真正有用.只有当您以字母A开头时,它才起作用.如果您想学习从任何字母开始的字母再现,则需要使用此类示例(字母的子序列或旋转)来训练网络.但是我认为常规前馈网络可以学习预测在(A,B),(B,C)等对上的字母训练的下一个字母. .

I agree that the example of an RNN for the alphabet does not really seem very useful in practice; it will only work when you start with the letter A. If you want to learn to reproduce the alphabet starting at any letter, you would need to train the network with that kind of examples (subsequences or rotations of the alphabet). But I think a regular feed-forward network could learn to predict the next letter of the alphabet training on pairs like (A, B), (B, C), etc. I think the example is meant for demonstrative purposes more than anything else.

您可能已经阅读过,但是很受欢迎的帖子递归神经网络的不合理有效性根据您想做的事情显示了一些有趣的结果(尽管它并没有真正深入到实现细节中).我没有使用文本数据培训RNN的个人经验,但是您可以研究多种方法.您可以构建基于字符的模型(如文章中的模型),在该模型中您一次输入并接收一个字符.一种更高级的方法是对文本进行一些预处理,然后将它们转换为数字序列.为此,Keras包含了一些文本预处理功能.将一个数字用作特征空间可能无法很好地工作,因此您可以简单地将每个单词转换为具有一键编码的矢量,或者更有趣的是,让网络为每个学习最佳的矢量表示,从而他们称为嵌入.您可以进一步进行预处理,并查找类似 NLTK 的内容,特别是如果您要删除停用词,标点符号诸如此类的事情.最后,如果您使用不同大小的序列(例如,您使用全文而不是固定大小的摘录,这对您可能或不重要),则需要多加注意,并使用样本加权.根据确切的问题,您可以相应地设置培训.如果您想学习生成相似的文本,则"Y"将与"X"(一次热编码)相似,仅移位一个(或多个)位置(在这种情况下,您可能需要使用 TimeDistributed图层).如果您想确定主持人,您的输出可能是 softmax密集层.

You may have probably already read it, but the popular post The Unreasonable Effectiveness of Recurrent Neural Networks shows some interesting results along the lines of what you want to do (although it does not really dive into implementation specifics). I don't have personal experience training RNN with textual data, but there is a number of approaches you can research. You can build character-based models (like the ones in the post), where your input and receive one character at a time. A more advanced approach is to do some preprocessing on the texts and transform them into sequences of numbers; Keras includes some text preprocessing functions to do that. Having one single number as feature space is probably not going to work all that well, so you could simply turn each word into a vector with one-hot encoding or, more interestingly, have the network learn the best vector representation for each for, which is what they call en embedding. You can go even further with the preprocessing and look into something like NLTK, specially if you want to remove stop words, punctuation and things like that. Finally, if you have sequences of different sizes (e.g. you are using full texts instead of excerpts of a fixed size, which may or may not be important for you) you will need to be a bit more careful and use masking and/or sample weighting. Depending on the exact problem, you can set up the training accordingly. If you want to learn to generate similar text, the "Y" would be the similar to the "X" (one-hot encoded), only shifted by one (or more) positions (in this case you may need to use return_sequences=True and TimeDistributed layers). If you want to determine the autor, your output could be a softmax Dense layer.

希望有帮助.

这篇关于了解有状态的LSTM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆