pytorch 数据加载器的示例或解释? [英] Examples or explanations of pytorch dataloaders?

查看:21
本文介绍了pytorch 数据加载器的示例或解释?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Pytorch 相当陌生(并且从未做过高级编码).我正在尝试使用 d2l.ai 教科书学习深度学习的基础知识,但在理解数据加载器代码背后的逻辑时遇到了麻烦.我阅读了 torch.utils.data docs 并且不确定 DataLoader 类是什么例如,当我应该将 torch.utils.data.TensorDataset 类与它结合使用时.比如d2l定义了一个函数:

I am fairly new to Pytorch (and have never done advanced coding). I am trying to learn the basics of deep learning using the d2l.ai textbook but am having trouble with understanding the logic behind the code for dataloaders. I read the torch.utils.data docs and am not sure what the DataLoader class is meant for, and when for example I am supposed to use the torch.utils.data.TensorDataset class in combination with it. For example, d2l defines a function:

def load_array(data_arrays, batch_size, is_train=True):
    """Construct a PyTorch data iterator."""
    dataset = data.TensorDataset(*data_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)

我认为这应该返回一个迭代不同批次的迭代.但是,我不明白 data.TensorDataset 部分的作用(似乎文档页面上列出了很多选项).此外,文档说有两种类型的数据集:可迭代和地图样式.在描述前一种类型时,它说

I assume this is supposed to return an iterable that iterates over different batches. However, I don't understand what the data.TensorDataset part does (seems like there are a lot of options listed on the docs page). Also, the documents say that there are two types of datasets: iterable and map style. When describing the former type, it says

这种类型的数据集特别适用于随机读取成本高昂甚至不可能的情况,以及批量大小取决于获取的数据的情况."

"This type of datasets is particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data."

随机读取是昂贵的或不可能的"是什么意思?并且batch_size 依赖于获取的数据?谁能举个例子?

What does it mean for "a random read to be expensive or improbable" and for the batch_size to depend on the fetched data? Can anyone give an example of this?

如果有像我这样的 CompSci 菜鸟可以学习这些基础知识的任何来源,我真的很感激提示!

If there is any source where a CompSci noob like me can learn these basics, I'd really appreciate tips!

非常感谢!

推荐答案

我将举例说明如何使用数据加载器并解释步骤:

Ill give you an example of how to use dataloaders and will explain the steps:

数据加载器是对数据集的迭代.因此,当您对其进行迭代时,它将从数据集收集的样本(包括数据样本和目标/标签)中随机返回 B,其中 B 是批量大小.要创建这样的数据加载器,您首先需要一个继承自 Dataset Pytorch 类的类.pytorch 中有一个此类的标准实现,它应该是 TensorDataset.但是标准的方法是创建一个自己的方法.这是图像分类的示例:

Dataloaders are iterables over the dataset. So when you iterate over it, it will return B randomly from the dataset collected samples (including the data-sample and the target/label), where B is the batch-size. To create such a dataloader you will first need a class which inherits from the Dataset Pytorch class. There is a standart implementation of this class in pytorch which should be TensorDataset. But the standard way is to create an own one. Here is an example for image classification:

import torch
from PIL import Image


class YourImageDataset(torch.utils.data.Dataset):
    def __init__(self, image_folder):
        self.image_folder = image_folder
        self.images = os.listdir(image_folder)

    # get sample
    def __getitem__(self, idx):
        image_file = self.images[idx]

        image = Image.open((self.image_folder + image_file))
        image = np.array(image)
        
        # normalize image
        image = image / 255

        # convert to tensor
        image = torch.Tensor(image).reshape(3, 512, 512)
        
        # get the label, in this case the label was noted in the name of the image file, ie: 1_image_28457.png where 1 is the label and the number at the end is just the id or something
        target = int(image_file.split("_")[0])
        target = torch.Tensor(target)

        return image, target

    def __len__(self):
        return len(self.images)

要获取示例图像,您可以调用该类并将一些随机索引传递给 getitem 函数.然后它将返回图像矩阵的张量和该索引处的标签张量.例如:

To get an example image you can call the class and pass some random index into the getitem function. It will then return the tensor of the image matrix and the tensor of the label at that index. For example:

dataset = YourImageDataset("/path/to/image/folder")
data, sample = dataset.__getitem__(0) # get data at index 0

好的,现在您已经创建了预处理并返回一个样本及其标签的类.现在我们必须创建 datalaoder,它包装"数据.围绕这个类,然后可以从您的数据集类中返回整批样本.让我们创建三个数据加载器,一个迭代训练集,一个用于测试集,一个用于验证集:

Alright, so now you have created the class which preprocesses and returns ONE sample and its label. Now we have to create the datalaoder, which "wraps" around this class and then can return whole batches of samples from your dataset class. Lets create three dataloaders, one which iterates over the train set, one for the test set and one for the validation set:

dataset = YourImageDataset("/path/to/image/folder")

# lets split the dataset into three parts (train 70%, test 15%, validation 15%)
test_size = 0.15
val_size = 0.15

test_amount, val_amount = int(dataset.__len__() * test_size), int(dataset.__len__() * val_size)

# this function will automatically randomly split your dataset but you could also implement the split yourself
train_set, val_set, test_set = torch.utils.data.random_split(dataset, [
            (dataset.__len__() - (test_amount + val_amount)), 
            test_amount, 
            val_amount
])


# B is your batch-size, ie. 128

train_dataloader = torch.utils.data.DataLoader(
            train_set,
            batch_size=B,
            shuffle=True,
)
val_dataloader = torch.utils.data.DataLoader(
            val_set,
            batch_size=B,
            shuffle=True,
)
test_dataloader = torch.utils.data.DataLoader(
            test_set,
            batch_size=B,
            shuffle=True,
)

现在您已经创建了数据加载器并准备好训练!例如像这样:

Now you have created your dataloaders and are ready to train! For example like this:


for epoch in range(epochs):

    for images, targets in train_dataloder:
        # now 'images' is a batch containing B samples
        # and 'targets' is a batch containing B targets (of the images in 'images' with the same index

        optimizer.zero_grad()
        images, targets = images.cuda(), targets.cuda()
        predictions = model.train()(images)
        
        . . .

通常您会为YourImageDataset"创建一个自己的文件.类,然后导入到要在其中创建数据加载器的文件.希望我能说清楚dataloader和Dataset类的作用是什么,如何使用!

Normaly you would create an own file for the "YourImageDataset" class and then import to the file in which you want to create the dataloaders. I hope I could make clear what the role of the dataloader and the Dataset class is and how to use them!

编辑以回答评论中的问题:所以我对 iter 风格的数据集了解不多,但据我所知:我上面向您展示的方法是地图风格.如果您的数据集存储在 csv、json 或任何类型的文件中,您就可以使用它.因此,您可以遍历数据集的所有行或条目.Iter-style 将带您使用数据集或数据集的一部分,并将转换为可迭代的.例如,如果您的数据集是一个列表,则列表的可迭代对象如下所示:

Edit to answer question in comments: So I dont know much about iter-style datasets but from what I understood: The method I showed you above, is the map-style. You use that, if your dataset is stored in a csv, json or whatever kind of file. So you can iterate through all rows or entries of the dataset. Iter-style will take you dataset or a part of the dataset and will convert in to an iterable. For example, if your dataset is a list, this is what an iterable of the list would look like:

dataset = [1,2,3,4]
dataset  = iter(dataset)

print(next(a))
print(next(a))
print(next(a))
print(next(a))

# output:
# >>> 1
# >>> 2
# >>> 3
# >>> 4

所以 next 会给你列表的下一个项目.将它与 Pytorch Dataloader 一起使用可能更高效、更快.通常,地图数据加载器足够快且易于使用,但文档假设当您从数据库加载数据批次(可能更慢)时,迭代式数据集会更有效.这个对 iter-style 的解释有点含糊,但我希望它能让你明白我的理解.我建议您首先使用地图样式,正如我在原始答案中所解释的 :)

So the next will give you the next item of the list. Using this together with a Pytorch Dataloader is probably more efficient and faster. Normaly the map-dataloader is fast enough and common to use, but the documentation supposed that when you are loading data-batches from a database (which can be slower) then iter-style dataset would be more efficient. This explanation of iter-style is a bit vague but I hope it makes you understand what I understood. I would recommend you to use the map-style first, as I explained it in my original answer :)

这篇关于pytorch 数据加载器的示例或解释?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆