如果事先不知道训练样本的顺序和总数,如何创建自定义PyTorch数据集? [英] How to create a custom PyTorch dataset when the order and the total number of training samples is not known in advance?

查看:192
本文介绍了如果事先不知道训练样本的顺序和总数,如何创建自定义PyTorch数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个42 GB的jsonl文件。该文件的每个元素都是一个json对象。我从每个json对象创建训练样本。但是我提取的每个json对象的训练样本数量可能在0到5个样本之间变化。在不读取内存中整个jsonl文件的情况下创建自定义PyTorch数据集的最佳方法是什么?

I have a 42 GB jsonl file. Every element of this file is a json object. I create training samples from every json object. But the number of training samples from every json object that I extract can vary between 0 to 5 samples. What is the best way to create a custom PyTorch dataset without reading the entire jsonl file in memory?

这就是我正在谈论的数据集- Google自然问题

This is the dataset I am talking about - Google Natural Questions.

推荐答案

您有两个选择。


  1. 如果有很多小文件,最简单的选择是将每个json对象预处理为单个文件。然后,您可以根据请求的索引仅阅读每个索引。例如



    class SingleFileDataset(Dataset):
        def __init__(self, list_of_file_paths):
            self.list_of_file_paths = list_of_file_paths

        def __getitem__(self, index):
            return np.load(self.list_of_file_paths[index]) # Or equivalent reading code for single file

的等效读取代码


  1. 您可以还将数据拆分成一定数量的文件,然后在给定索引的情况下计算样本所在的文件。然后,您需要将该文件打开到内存中并读取适当的索引。这在磁盘访问和内存使用之间进行了权衡。假设您有 n 个样本,我们在预处理过程中将样本均匀地分成 c 个文件。现在,要读取索引 i 的示例,我们将

  1. You can also split the data into a constant number of files, and then calculate, given the index, which file the sample resides in. Then you need to open that file into memory and read the appropriate index. This gives a trade-off between disk access and memory usage. Assume you have n samples, and we split the samples into c files evenly during preprocessing. Now, to read the sample at index i we would do



    class SplitIntoFilesDataset(Dataset):
        def __init__(self, list_of_file_paths, n_splits):
            self.list_of_file_paths = list_of_file_paths
            self.n_splits = n_splits

        def __getitem__(self, index):
            # index // n_splits is the relevant file, and 
            # index % len(self) is the index in in that file
            file_to_load = self.list_of_file_paths[index // self.n_splits]
            # Load file
            file = np.load(file)
            datapoint = file[index % len(self)]




  1. 最后,您可以使用 HDF5 文件,该文件允许访问磁盘上的行。如果您有大量数据,这可能是最好的解决方案,因为数据将在磁盘上关闭。有一个实现此处,我将其复制粘贴到下面:

  1. Finally, you could use a HDF5 file that allows access to rows on disk. This is possibly the best solution if you have a lot of data, since the data will be close on disk. There's an implementation here which I have copy pasted below:

import h5py
import torch
import torch.utils.data as data
class H5Dataset(data.Dataset):

    def __init__(self, file_path):
        super(H5Dataset, self).__init__()
        h5_file = h5py.File(file_path)
        self.data = h5_file.get('data')
        self.target = h5_file.get('label')

    def __getitem__(self, index):            
        return (torch.from_numpy(self.data[index,:,:,:]).float(),
                torch.from_numpy(self.target[index,:,:,:]).float())

    def __len__(self):
        return self.data.shape[0]


这篇关于如果事先不知道训练样本的顺序和总数,如何创建自定义PyTorch数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆