PyTorch LSTM 输入维度 [英] PyTorch LSTM input dimension

查看:25
本文介绍了PyTorch LSTM 输入维度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 PyTorch LSTM 训练一个简单的 2 层神经网络,但在解释 PyTorch 文档时遇到了问题.具体来说,我不太确定如何处理我的训练数据的形状.

I'm trying train a simple 2 layer neural network with PyTorch LSTMs and I'm having trouble interpreting the PyTorch documentation. Specifically, I'm not too sure how to go about with the shape of my training data.

我想做的是通过小批量在一个非常大的数据集上训练我的网络,其中每个批次的长度为 100 个元素.每个数据元素将有 5 个特征.文档指出层的输入应该是形状(seq_len、batch_size、input_size).我应该如何调整输入?

What I want to do is train my network on a very large dataset through mini-batches, where each batch is say, 100 elements long. Each data element will have 5 features. The documentation states that the input to the layer should be of shape (seq_len, batch_size, input_size). How should I go about shaping the input?

我一直在关注这篇文章:https://discuss.pytorch.org/t/understanding-lstm-input/31110/3如果我正确解释了这一点,每个小批量应该是形状 (100, 100, 5).但是在这种情况下,seq_len 和 batch_size 有什么区别呢?另外,这是否意味着输入 LSTM 层的第一层应该有 5 个单元?

I've been following this post: https://discuss.pytorch.org/t/understanding-lstm-input/31110/3 and if I'm interpreting this correctly, each minibatch should be of shape (100, 100, 5). But in this case, what's the difference between seq_len and batch_size? Also, would this mean that the first layer that the input LSTM layer should have 5 units?

谢谢!

推荐答案

这是一个老问题,但是因为它已经被浏览了 80 多次没有回应,让我来破解它.

This is an old question, but since it has been viewed 80+ times with no response, let me take a crack at it.

LSTM 网络用于预测序列.在 NLP 中,这将是一个单词序列;在经济学中,一系列经济指标;等

An LSTM network is used to predict a sequence. In NLP, that would be a sequence of words; in economics, a sequence of economic indicators; etc.

第一个参数是这些序列的长度.如果你的序列数据是由句子组成的,那么Tom has a black and beautiful cat"是一个长度为 7 (seq_len) 的序列,每个单词一个,也许第 8 个表示句子的结尾.

The first parameter is the length of those sequences. If you sequence data is made of sentences, then "Tom has a black and ugly cat" is a sequence of length 7 (seq_len), one for each word, and maybe an 8th to indicate the end of the sentence.

当然,您可能会反对如果我的序列长度不同怎么办?"这是一种常见的情况.

Of course, you might object "what if my sequences are of varying length?" which is a common situation.

最常见的两种解决方案是:

The two most common solutions are:

  1. 用空元素填充你的序列.例如,如果您拥有的最长句子有 15 个单词,则将上面的句子编码为[Tom] [has] [a] [black] [and] [ugly] [cat] [EOS] [] [] [][] [] [] []",其中EOS代表句尾.突然间,您的所有序列的长度都变成了 15,这就解决了您的问题.一旦找到 [EOS] 代币,该模型将很快了解到它后面是无限序列的空代币 [],这种方法几乎不会对您的网络造成负担.

  1. Pad your sequences with empty elements. For instance, if the longest sentence you have has 15 words, then encode the sentence above as "[Tom] [has] [a] [black] [and] [ugly] [cat] [EOS] [] [] [] [] [] [] []", where EOS stands for end of sentence. Suddenly, all your sequences become of length 15, which solves your issue. As soon as the [EOS] token is found, the model will learn quickly that it is followed by an unlimited sequence of empty tokens [], and that approach will barely tax your network.

发送等长的小批量.例如,在所有句子上用 2 个单词训练网络,然后用 3 个,然后用 4 个.当然,seq_len 会在每个 minibatch 时增加,每个 minibatch 的大小将根据长度为 N 的序列的数量而变化你有你的数据.

Send mini-batches of equal lengths. For instance, train the network on all sentences with 2 words, then with 3, then with 4. Of course, seq_len will be increased at each mini batch, and the size of each mini batch will vary based on how many sequences of length N you have in your data.

两全其美的方法是将您的数据分成大小大致相同的小批量,按大致长度对它们进行分组,并仅添加必要的填充.例如,如果您将长度为 6、7 和 8 的句子小批量处理在一起,那么长度为 8 的序列将不需要填充,而长度为 6 的序列只需要 2 个. 如果您有一个包含长度变化很大的序列的大型数据集,这是最好的方法.

A best-of-both-world approach would be to divide your data into mini batches of roughly equal sizes, grouping them by approximate length, and adding only the necessary padding. For instance, if you mini-batch together sentences of length 6, 7 and 8, then sequences of length 8 will require no padding, whereas sequence of length 6 will require only 2. If you have a large dataset with sequences of widely varying length, that's the best approach.

不过,选项 1 是最简单(也是最懒惰)的方法,并且适用于小型数据集.

Option 1 is the easiest (and laziest) approach, though, and will work great on small datasets.

最后一件事...始终在末尾填充数据,而不是在开头.

One last thing... Always pad your data at the end, not at the beginning.

希望能帮到你.

这篇关于PyTorch LSTM 输入维度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆