PyTorch:动态计算图之间的关系-填充-DataLoader [英] PyTorch: Relation between Dynamic Computational Graphs - Padding - DataLoader

查看:493
本文介绍了PyTorch:动态计算图之间的关系-填充-DataLoader的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我了解,PyTorch的优势应该在于它可以与动态计算图一起使用.在NLP的上下文中,这意味着长度可变的序列不一定需要填充为相同的长度.但是,如果我想使用PyTorch DataLoader,则无论如何都要填充序列,因为DataLoader只采用张量-鉴于我作为一个初学者,完全不想构建一些自定义的collat​​e_fn.

As far as I understand, the strength of PyTorch is supposed to be that it works with dynamic computational graphs. In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length. But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.

现在,这让我感到奇怪-在这种情况下,这难道不消除动态计算图的全部优势吗? 另外,如果我填充序列以将其作为张量输入到DataLoader中,并在末尾添加许多零作为填充令牌(在单词id的情况下),这会对我的训练有负面影响,因为PyTorch可能无法针对带填充序列的计算(因为整个前提是它可以在动态图中使用可变的序列长度),还是根本没有任何区别?

Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context? Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end (in the case of word ids), will it have any negative effect on my training since PyTorch may not be optimized for computations with padded sequences (since the whole premise is that it can work with variable sequence lengths in the dynamic graphs), or does it simply not make any difference?

我还将在PyTorch论坛上发布此问题...

I will also post this question in the PyTorch Forum...

谢谢!

推荐答案

在NLP中,这意味着长度可变的序列不一定需要填充为相同的长度.

In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length.

这意味着您无需填充序列,除非您正在进行数据批处理,这是目前在PyTorch中添加并行性的唯一方法. DyNet有一种称为 autobatching 的方法(在本文)对图形操作(而不是数据)进行批处理,因此这可能就是您想要的调查.

This means that you don't need to pad sequences unless you are doing data batching which is currently the only way to add parallelism in PyTorch. DyNet has a method called autobatching (which is described in detail in this paper) that does batching on the graph operations instead of the data, so this might be what you want to look into.

但是,如果我想使用PyTorch DataLoader,则无论如何都要填充序列,因为DataLoader只采用张量-鉴于我作为一个总的初学者并不想构建一些自定义的collat​​e_fn.

But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.

如果您编写了自己的Dataset类并且正在使用batch_size=1,则可以使用DataLoader.难点是对可变长度序列使用numpy数组(否则default_collate会给您带来麻烦):

You can use the DataLoader given you write your own Dataset class and you are using batch_size=1. The twist is to use numpy arrays for your variable length sequences (otherwise default_collate will give you a hard time):

from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader

class FooDataset(Dataset):
    def __init__(self, data, target):
        assert len(data) == len(target)
        self.data = data
        self.target = target
    def __getitem__(self, index):
        return self.data[index], self.target[index]
    def __len__(self):
        return len(self.data)

data = [[1,2,3], [4,5,6,7,8]]
data = [np.array(n) for n in data]
targets = ['a', 'b']

ds = FooDataset(data, targets)
dl = DataLoader(ds, batch_size=1)

print(list(enumerate(dl)))
# [(0, [
#  1  2  3
# [torch.LongTensor of size 1x3]
# , ('a',)]), (1, [
#  4  5  6  7  8
# [torch.LongTensor of size 1x5]
# , ('b',)])]

现在,这让我感到奇怪-在这种情况下,这难道不消除动态计算图的全部优势吗?

Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?

公平,但是动态计算图的主要优点(至少当前是),主要是使用诸如pdb之类的调试工具来迅速减少开发时间的可能性.使用静态计算图进行调试会更加困难.也没有理由为什么PyTorch将来不会实施进一步的即时优化或类似于DyNet的自动批处理的概念.

Fair point but the main strength of dynamic computational graphs are (at least currently) mainly the possibility of using debugging tools like pdb which rapidly decrease your development time. Debugging is way harder with static computation graphs. There is also no reason why PyTorch would not implement further just-in-time optimizations or a concept similar to DyNet's auto-batching in the future.

此外,如果我填充序列以将其作为张量输入到DataLoader中,并在末尾添加许多零作为填充令牌,那么它对我的训练有负面影响吗? /p>

Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end [...], will it have any negative effect on my training [...]?

是的,无论是在运行时还是在渐变中. RNN会像普通数据一样对填充进行迭代,这意味着您必须以某种方式进行处理. PyTorch为您提供了用于处理填充序列和RNN的工具,即

Yes, both in runtime and for the gradients. The RNN will iterate over the padding just like normal data which means that you have to deal with it in some way. PyTorch supplies you with tools for dealing with padded sequences and RNNs, namely pad_packed_sequence and pack_padded_sequence. These will let you ignore the padded elements during RNN execution, but beware: this does not work with RNNs that you implement yourself (or at least not if you don't add support for it manually).

这篇关于PyTorch:动态计算图之间的关系-填充-DataLoader的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆