使用pytorch和sklearn对MNIST数据集进行交叉验证 [英] Cross validation for MNIST dataset with pytorch and sklearn
问题描述
我是pytorch的新手,正在尝试实现前馈神经网络来对mnist数据集进行分类.尝试使用交叉验证时遇到一些问题.我的数据具有以下形状:
x_train
:
torch.Size([45000, 784])
和
y_train
:torch.Size([45000])
I am new to pytorch and are trying to implement a feed forward neural network to classify the mnist data set. I have some problems when trying to use cross-validation. My data has the following shapes:
x_train
:
torch.Size([45000, 784])
and
y_train
: torch.Size([45000])
我尝试使用sklearn中的KFold.
I tried to use KFold from sklearn.
kfold =KFold(n_splits=10)
这是我的训练方法的第一部分,其中我将数据分为折叠部分:
Here is the first part of my train method where I'm dividing the data into folds:
for train_index, test_index in kfold.split(x_train, y_train):
x_train_fold = x_train[train_index]
x_test_fold = x_test[test_index]
y_train_fold = y_train[train_index]
y_test_fold = y_test[test_index]
print(x_train_fold.shape)
for epoch in range(epochs):
...
y_train_fold
变量的索引是正确的,它很简单:
[ 0 1 2 ... 4497 4498 4499]
,但不适用于x_train_fold
,即[ 4500 4501 4502 ... 44997 44998 44999]
.测试褶皱也是如此.
The indices for the y_train_fold
variable is right, it's simply:
[ 0 1 2 ... 4497 4498 4499]
, but it's not for x_train_fold
, which is [ 4500 4501 4502 ... 44997 44998 44999]
. And the same goes for the test folds.
对于第一次迭代,我希望将变量x_train_fold
作为前4500张图片,换句话说,其形状为torch.Size([4500, 784])
,但形状为torch.Size([40500, 784])
For the first iteration I want the varibale x_train_fold
to be the first 4500 pictures, in other words to have the shape torch.Size([4500, 784])
, but it has the shape torch.Size([40500, 784])
有关如何正确解决此问题的任何提示?
Any tips on how to get this right?
推荐答案
我认为您很困惑!
暂时忽略第二维,当您获得45000点并使用10折交叉验证时,每折的大小是多少? 45000/10,即4500.
Ignore the second dimension for a while, When you've 45000 points, and you use 10 fold cross-validation, what's the size of each fold? 45000/10 i.e. 4500.
这意味着您的每一折都将包含4500个数据点,其中一折将用于测试,其余用于训练,即
It means that each of your fold will contain 4500 data points, and one of those fold will be used for testing, and the remaining for training i.e.
用于测试: 1折=> 4500个数据点=>大小:4500
用于训练:剩余褶皱=> 45000-4500个数据点=>大小:45000-4500 = 40500
For testing: one fold => 4500 data points => size: 4500
For training: remaining folds => 45000-4500 data points => size: 45000-4500=40500
因此,对于第一次迭代,将使用前4500个数据点(对应于索引)进行测试,其余的将用于训练. (检查下面的图片)
Thus, for first iteration, the first 4500 data points (corresponding to indices) will be used for testing and the rest for training. (Check below image)
鉴于您的数据是x_train: torch.Size([45000, 784])
和y_train: torch.Size([45000])
,这就是您的代码的样子:
Given your data is x_train: torch.Size([45000, 784])
and y_train: torch.Size([45000])
, this is how your code should look like:
for train_index, test_index in kfold.split(x_train, y_train):
print(train_index, test_index)
x_train_fold = x_train[train_index]
y_train_fold = y_train[train_index]
x_test_fold = x_train[test_index]
y_test_fold = y_train[test_index]
print(x_train_fold.shape, y_train_fold.shape)
print(x_test_fold.shape, y_test_fold.shape)
break
[ 4500 4501 4502 ... 44997 44998 44999] [ 0 1 2 ... 4497 4498 4499]
torch.Size([40500, 784]) torch.Size([40500])
torch.Size([4500, 784]) torch.Size([4500])
所以,当你说
我希望变量
x_train_fold
是第一个4500张图片... shape torch.Size([4500,784]).
I want the variable
x_train_fold
to be the first 4500 picture... shape torch.Size([4500, 784]).
你错了.此大小对应于x_test_fold
.在第一次迭代中,基于10倍,x_train_fold
将获得40500点,因此其大小应为torch.Size([40500, 784])
.
you're wrong. this size corresonds to x_test_fold
. In the first iteration, based on 10 folds, x_train_fold
will have 40500 points, thus its size is supposed to be torch.Size([40500, 784])
.
这篇关于使用pytorch和sklearn对MNIST数据集进行交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!