分批加载庞大的数据集以训练pytorch [英] Loading a huge dataset batch-wise to train pytorch

查看:389
本文介绍了分批加载庞大的数据集以训练pytorch的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在训练LSTM,以便将时间序列数据分为2类(0和1).我在0类和1类数据所在的驱动器上有巨大的数据集我试图通过创建Dataset类并将DataLoader围绕它来批量训练LSTM.我必须进行诸如重塑的预处理.这是我的代码

I am training a LSTM in-order to classify the time-series data into 2 classes(0 and 1).I have huge data-set on the drive where where the 0-class and the 1-class data are located in different folders.I am trying to train the LSTM batch-wise using by creating a Dataset class and wrapping the DataLoader around it. I have to do pre-processing such as reshaping.Here's my code which does that

`

class LoadingDataset(Dataset):
  def __init__(self,data_root1,data_root2,file_name):
    self.data_root1=data_root1#Has the path for class1 data
    self.data_root2=data_root2#Has the path for class0 data
    self.fileap1= pd.DataFrame()#Stores class 1 data
    self.fileap0 = pd.DataFrame()#Stores class 0 data
    self.file_name=file_name#List of all the files at data_root1 and data_root2
    self.labs1=None #Will store the class 1 labels
    self.labs0=None #Will store the class 0 labels

  def __len__(self):
    return len(self.fileap1) 

  def __getitem__(self, index):        
    self.fileap1 = pd.read_csv(self.data_root1+self.file_name[index],header=None)#read the csv file for class 1
    self.fileap1=self.fileap1.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
    self.fileap0 = pd.read_csv(self.data_root2+self.file_name[index],header=None)#read the csv file for class 0
    self.fileap0=self.fileap0.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
    self.labs1=np.array([1]*len(self.fileap1)).reshape(-1,1)#create the labels 1 for the csv file
    self.labs0=np.array([0]*len(self.fileap0)).reshape(-1,1)#create the labels 0 for the csv file
    # print(self.fileap1.shape,' ',self.fileap0.shape)
    # print(self.labs1.shape,' ',self.labs0.shape)
    self.fileap1=np.append(self.fileap1,self.fileap0,axis=0)#combine the class 0 and class one data
    self.fileap1 = torch.from_numpy(self.fileap1).float()
    self.labs1=np.append(self.labs1,self.labs0,axis=0)#combine the label0 and label 1 data
    self.labs1 = torch.from_numpy(self.labs1).int()
    # print(self.fileap1.shape,' ',self.fileap0.shape)
    # print(self.labs1.shape,' ',self.labs0.shape)

    return self.fileap1,self.labs1

data_root1 = '/content/gdrive/My Drive/Data/Processed_Data/Folder1/One_'#location of class 1 data
data_root2 = '/content/gdrive/My Drive/Data/Processed_Data/Folder0/Zero_'#location of class 0 data
training_set=LoadingDataset(data_root1,data_root2,train_ind)#train_ind is a list of file names that have to be read from data_root1 and data_root2
training_generator = DataLoader(training_set,batch_size =2,num_workers=4)

for epoch in range(num_epochs):
  model.train()#Setting the model to train mode after eval mode to train for next epoch once the testing for that epoch is finished
  for i, (inputs, targets) in enumerate(train_loader):
    .
    .
    .
    .

`运行此代码时出现此错误

` I get this error when the run this code

RuntimeError:回溯(最近一次呼叫最近):文件"/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py",第99行,在_worker_loop中样本= collat​​e_fn([batch_indices中[i的数据集[i]])文件"/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collat​​e.py",行68,在default_collat​​e中返回[转置后的样本的default_collat​​e(samples)]文件"/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collat​​e.py",第68行,在返回[default_collat​​e(samples)进行转置的样本]文件"/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collat​​e.py",第43行,在default_collat​​e中返回torch.stack(batch,0,out = out)RuntimeError:无效参数0:张量的大小必须匹配,但维0除外.在/pytorch/aten/src/TH/generic/THTensor.cpp:711

RuntimeError: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 68, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 68, in return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 43, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 96596 and 25060 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:711

我的问题是1.我是否正确实现了这一点,这是您进行预处理然后分批训练数据集的方式吗?

My Questions are 1.Have I Implemented this correctly, is this how you pre-process and then train a dataset batch-wise?

2.DataLoader的batch_size和LSTM的batch_size不同,因为DataLoader的batch_size引用了no.文件的数量,而LSTM模型的batch_size表示否.实例,那么我会在这里得到另一个错误吗?

2.The batch_size of DataLoader and batch_size of the LSTM are different since the batch_size of DataLoader refers to the no. of files, whereas batch_size of the LSTM model refers to the no. of instances, so will I get another error here?

3.我不知道如何缩放此数据集,因为必须将MinMaxScaler完整应用于整个数据集.

3.I have no idea how to scale this data-set since the MinMaxScaler has to be applied to the dataset in its entirety.

感谢您的答复.如果需要为每个问题创建单独的帖子,请告诉我.

Responses are appreciated.Please let me know if I have to create separate posts for each question.

谢谢.

推荐答案

以下是pytorch的工作方式摘要:

Here's a summary of how pytorch does things :

  • 您有一个 dataset ,它是一个具有 __ len __ 方法和 __ getitem __ 方法的对象.
  • 您可以从该 dataset 和一个 collat​​e_fn
  • 创建一个 dataloader
  • 您遍历 dataloader 并将一批数据传递给模型.
  • You have a dataset, that is an object with a __len__ method and a __getitem__ method.
  • You create a dataloader from that dataset and a collate_fn
  • You iterate through the dataloader and pass a batch of data to your model.

因此,基本上,您的训练循环将类似于

So basically your training loop will look like

for x, y in dataloader:
    output = model(x)
...

for x, y in dataloader:
        output = model(*x)
    ...

如果模型的 forward 方法采用多个参数.

if your model forward method takes multiple arguments.

那么这是如何工作的?基本上,您有一个批处理索引生成器 batch_sampler ,这就是数据加载器内部的循环.

So how does this work ? Basically you have a generator of batch indices batch_sampler and here's what looping inside your dataloader will act like.

for indices in batch_sampler:
    yield collate_fn([dataset[i] for i in indices])

因此,如果您希望一切正常工作,则必须查看模型的 forward 方法,并查看需要多少个参数(根据我的经验,LSTM的正向方法可以有多个参数),并确保您使用 collat​​e_fn 正确传递这些密码.

So if you want everything to work well, you must look at the forward method of your model and see how many arguments it takes (In my experience forward method of LSTM can have multiple arguments), and make sure that you use a collate_fn to pass those correctly.

这篇关于分批加载庞大的数据集以训练pytorch的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆