分批加载庞大的数据集以训练pytorch [英] Loading a huge dataset batch-wise to train pytorch
问题描述
我正在训练LSTM,以便将时间序列数据分为2类(0和1).我在0类和1类数据所在的驱动器上有巨大的数据集我试图通过创建Dataset类并将DataLoader围绕它来批量训练LSTM.我必须进行诸如重塑的预处理.这是我的代码
I am training a LSTM in-order to classify the time-series data into 2 classes(0 and 1).I have huge data-set on the drive where where the 0-class and the 1-class data are located in different folders.I am trying to train the LSTM batch-wise using by creating a Dataset class and wrapping the DataLoader around it. I have to do pre-processing such as reshaping.Here's my code which does that
`
class LoadingDataset(Dataset):
def __init__(self,data_root1,data_root2,file_name):
self.data_root1=data_root1#Has the path for class1 data
self.data_root2=data_root2#Has the path for class0 data
self.fileap1= pd.DataFrame()#Stores class 1 data
self.fileap0 = pd.DataFrame()#Stores class 0 data
self.file_name=file_name#List of all the files at data_root1 and data_root2
self.labs1=None #Will store the class 1 labels
self.labs0=None #Will store the class 0 labels
def __len__(self):
return len(self.fileap1)
def __getitem__(self, index):
self.fileap1 = pd.read_csv(self.data_root1+self.file_name[index],header=None)#read the csv file for class 1
self.fileap1=self.fileap1.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
self.fileap0 = pd.read_csv(self.data_root2+self.file_name[index],header=None)#read the csv file for class 0
self.fileap0=self.fileap0.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
self.labs1=np.array([1]*len(self.fileap1)).reshape(-1,1)#create the labels 1 for the csv file
self.labs0=np.array([0]*len(self.fileap0)).reshape(-1,1)#create the labels 0 for the csv file
# print(self.fileap1.shape,' ',self.fileap0.shape)
# print(self.labs1.shape,' ',self.labs0.shape)
self.fileap1=np.append(self.fileap1,self.fileap0,axis=0)#combine the class 0 and class one data
self.fileap1 = torch.from_numpy(self.fileap1).float()
self.labs1=np.append(self.labs1,self.labs0,axis=0)#combine the label0 and label 1 data
self.labs1 = torch.from_numpy(self.labs1).int()
# print(self.fileap1.shape,' ',self.fileap0.shape)
# print(self.labs1.shape,' ',self.labs0.shape)
return self.fileap1,self.labs1
data_root1 = '/content/gdrive/My Drive/Data/Processed_Data/Folder1/One_'#location of class 1 data
data_root2 = '/content/gdrive/My Drive/Data/Processed_Data/Folder0/Zero_'#location of class 0 data
training_set=LoadingDataset(data_root1,data_root2,train_ind)#train_ind is a list of file names that have to be read from data_root1 and data_root2
training_generator = DataLoader(training_set,batch_size =2,num_workers=4)
for epoch in range(num_epochs):
model.train()#Setting the model to train mode after eval mode to train for next epoch once the testing for that epoch is finished
for i, (inputs, targets) in enumerate(train_loader):
.
.
.
.
`运行此代码时出现此错误
` I get this error when the run this code
RuntimeError:回溯(最近一次呼叫最近):文件"/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py",第99行,在_worker_loop中样本= collate_fn([batch_indices中[i的数据集[i]])文件"/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py",行68,在default_collate中返回[转置后的样本的default_collate(samples)]文件"/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py",第68行,在返回[default_collate(samples)进行转置的样本]文件"/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py",第43行,在default_collate中返回torch.stack(batch,0,out = out)RuntimeError:无效参数0:张量的大小必须匹配,但维0除外.在/pytorch/aten/src/TH/generic/THTensor.cpp:711
RuntimeError: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 68, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 68, in return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 43, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 96596 and 25060 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:711
我的问题是1.我是否正确实现了这一点,这是您进行预处理然后分批训练数据集的方式吗?
My Questions are 1.Have I Implemented this correctly, is this how you pre-process and then train a dataset batch-wise?
2.DataLoader的batch_size和LSTM的batch_size不同,因为DataLoader的batch_size引用了no.文件的数量,而LSTM模型的batch_size表示否.实例,那么我会在这里得到另一个错误吗?
2.The batch_size of DataLoader and batch_size of the LSTM are different since the batch_size of DataLoader refers to the no. of files, whereas batch_size of the LSTM model refers to the no. of instances, so will I get another error here?
3.我不知道如何缩放此数据集,因为必须将MinMaxScaler完整应用于整个数据集.
3.I have no idea how to scale this data-set since the MinMaxScaler has to be applied to the dataset in its entirety.
感谢您的答复.如果需要为每个问题创建单独的帖子,请告诉我.
Responses are appreciated.Please let me know if I have to create separate posts for each question.
谢谢.
推荐答案
以下是pytorch的工作方式摘要:
Here's a summary of how pytorch does things :
- 您有一个
dataset
,它是一个具有__ len __
方法和__ getitem __
方法的对象. - 您可以从该
dataset
和一个collate_fn
创建一个 - 您遍历
dataloader
并将一批数据传递给模型.
dataloader
- You have a
dataset
, that is an object with a__len__
method and a__getitem__
method. - You create a
dataloader
from thatdataset
and acollate_fn
- You iterate through the
dataloader
and pass a batch of data to your model.
因此,基本上,您的训练循环将类似于
So basically your training loop will look like
for x, y in dataloader:
output = model(x)
...
或
for x, y in dataloader:
output = model(*x)
...
如果模型的 forward
方法采用多个参数.
if your model forward
method takes multiple arguments.
那么这是如何工作的?基本上,您有一个批处理索引生成器 batch_sampler
,这就是数据加载器内部的循环.
So how does this work ?
Basically you have a generator of batch indices batch_sampler
and here's what looping inside your dataloader will act like.
for indices in batch_sampler:
yield collate_fn([dataset[i] for i in indices])
因此,如果您希望一切正常工作,则必须查看模型的 forward
方法,并查看需要多少个参数(根据我的经验,LSTM的正向方法可以有多个参数),并确保您使用 collate_fn
正确传递这些密码.
So if you want everything to work well, you must look at the forward
method of your model and see how many arguments it takes (In my experience forward method of LSTM can have multiple arguments), and make sure that you use a collate_fn
to pass those correctly.
这篇关于分批加载庞大的数据集以训练pytorch的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!