如何自定义pytorch数据 [英] How to customize pytorch data

查看:20
本文介绍了如何自定义pytorch数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 pytorch 制作自定义的 Dataloader.

I am trying to make a customized Dataloader using pytorch.

我见过一些代码,比如(省略了课程,抱歉.)

I've seen some codes like (omitted the class sorry.)

def __init__(self, data_root, transform=None, training=True, return_id=False):
    super().__init__()
    self.mode = 'train' if training else 'test'

    self.data_root = Path(data_root)
    csv_fname = 'train.csv' if training else 'sample_submission.csv'
    self.csv_file = pd.read_csv(self.data_root / csv_fname)
    self.transform = transform
    self.return_id = return_id
def __getitem__():
    """ TODO
    """
def __len__():
    """ TODO
    """

这里的问题是我之前处理的数据包含一个csv文件中的所有训练数据,以及另一个csv文件中的所有测试数据,总共有2个csv文件用于训练和测试.(例如在 MNIST 中,最后一列是标签,之前的所有列都是不同的特征.)

The problem here is that the datas I've dealt with before contains all the training data in one csv file, and all the testing data in the other csv file, with total 2 csv files for training and testing. (For example like in MNIST, the last column is the labeling, and the all the previous columns are each different features.)

然而,我一直面临的问题是我有很多(大约 200,000 个)用于训练的 csv 文件,每个文件都小于 60,000 大小的 MNIST,但仍然相当大.所有这些 csv 文件都包含不同数量的行.

However, the problem I've been facing is that I've got very many, (about 200,000) csv files for training, each one smaller than 60,000 sized MNIST, but still quite big. All these csv files contain different number of rows.

继承torch.util.data,如何自定义类?MNIST 数据集非常小,因此可以一次上传到 RAM 上.但是,我处理的数据非常大,所以我需要一些帮助.

To inherit torch.util.data, how can I make customized class? MNIST dataset is quite small, so can be uploaded on RAM at once. However, the data I'm dealing with is super big, so I need some help.

有什么想法吗?提前致谢.

Any ideas? Thank you in advance.

推荐答案

首先,要自定义(重载)data.Dataset 而不是 data.DataLoader 这非常适合您的用例.

First, you want to customize (overload) data.Dataset and not data.DataLoader which is perfectly fine for your use case.

您可以做的不是将所有数据加载到 RAM,而是在 __init__ 上读取和存储元数据",并在需要 __getitem__ 特定条目.
Dataset 的伪代码如下所示:

What you can do, instead of loading all data to RAM, is to read and store "meta data" on __init__ and read one relevant csv file whnever you need to __getitem__ a specific entry.
A pseudo-code of your Dataset will look something like:

class ManyCSVsDataset(data.Dataset):
  def __init__(self, ...):
    super(ManyCSVsDataset, self).__init__()
    # store the paths for all csvs and the number of items in each one
    self.metadata = ... 
    self.num_items = total_number_of_items

  def __len__(self):
    return self.num_items

  def __getitem__(self, index):
    # based on the index, use self.metadata to determine what csv file to open
    with open(relevant_csv_file, 'r') as R:
      # read from R the specific line matching item index
    return item

这个实现的效率不高,因为它一遍又一遍地读取同一个 csv 文件并且不缓存任何东西.另一方面,您可以利用 data.DataLoader 的多处理支持让许多并行子进程在您实际使用数据进行训练时在后台执行所有这些文件访问.

This implementation is not efficient in the sense that it reads the same csv file over and over and does not cache anything. On the other hand, you can take advantage of data.DataLoader's multi processing support to have many parallel sub-processes doing all these file access at the background while you actually use the data for training.

这篇关于如何自定义pytorch数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆