如何将不适合内存的巨大数据集拆分和加载到pytorch Dataloader中? [英] How to split and load huge dataset that doesn't fit into memory into pytorch Dataloader?

查看:52
本文介绍了如何将不适合内存的巨大数据集拆分和加载到pytorch Dataloader中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在训练一种深度学习模型,以便使用Google的Colab在NIH的Chest Xray-14数据集中对疾病进行多标签分类.大约有112k的训练示例和有限的RAM,我无法一次将所有图像加载到Dataloader中.

有没有一种方法可以将图像的路径仅存储在pytorch的DataLoader中,仅读取训练期间当前迭代所需的那些图像,并且一旦迭代完成,就会从内存中卸载图像,依此类推,直到一个时期完成

解决方案

是, ImageFolder 用于创建图像路径列表,并仅在需要时加载实际图像.它不支持多类标签.但是,您可以编写自己的数据集以支持多标签,请参见 ImageFolder 类以获取详细信息

__ init __ 期间,您将构造图像路径列表和相应的标签列表.仅应在调用 __ getitem __ 时加载图像.下面是此类数据集类的存根,详细信息取决于文件的组织,图像类型和标签格式.

  Class CustomDataset(torch.utils.data.Dataset):def __init __(self,args):"构造图像路径和标签的索引列表""def __getitem __(self,n):"将图像n加载到图像路径列表中,并返回其标签.在多类的情况下,标签可能是值列表"def __len __(自己):"返回此数据集中的图像总数"" 

创建了有效的数据集实例后,即 DataLoader 应该被创建,并提供您的数据集作为参数.DataLoader负责采样其数据集,即调用您编写的 __ getitem __ 方法,并将各个样本放入迷你批处理中.它还处理并行加载并定义如何对索引进行采样.DataLoader本身不会存储超出其需求的内容.随时可以保存在内存中的样本最大数量为 batch_size * num_workers (如果 num_workers == 0 ,则为 batch_size ).

I'm training a deep learning model to do multi label classification of diseases in NIH's Chest Xray-14 dataset using Google's Colab. I can't load all images into Dataloader at once, given around 112k training examples and limited RAM.

Is there a way to just store path of images in pytorch's DataLoader, reading only those images needed for current iteration during training, and once iteration is complete, the images are unloaded from memory, so on so forth until one epoch is complete.

解决方案

Yes, the default behavior for the ImageFolder is to create a list of image paths and load the actual images only when needed. It doesn't support multiclass labels. However, you can write your own Dataset to support multi-label, referencing the ImageFolder class for details.

During __init__ you construct a list of image paths and a corresponding list of labels. Images should only be loaded only when __getitem__ is invoked. Below is a stub of such a dataset class, the details will depend on the organization of your files, image types, and label format.

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, args):
        """ Construct an indexed list of image paths and labels """

    def __getitem__(self, n):
        """ Load image n in the list of image paths and return it along with its label.
            In the case of multiclass the label will probably be a list of values"""

    def __len__(self):
        """ return the total number of images in this dataset """

Once you've created a valid dataset instance an instance of DataLoader should be created, providing your dataset as an argument. A DataLoader is responsible for sampling its dataset, i.e. invoking the __getitem__ method you wrote, and putting individual samples into mini-batches. It also handles parallelized loading and defines how the indices are sampled. The DataLoader itself doesn't store more than it needs. The maximum number of samples it should hold in memory at any time is batch_size * num_workers (or batch_size if num_workers == 0).

这篇关于如何将不适合内存的巨大数据集拆分和加载到pytorch Dataloader中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆