使用 pytorch 和 sklearn 对 MNIST 数据集进行交叉验证 [英] Cross validation for MNIST dataset with pytorch and sklearn

查看：236 发布时间：2022/1/6 19:49:15 scikit-learn pytorch cross-validation mnist k-fold

本文介绍了使用 pytorch 和 sklearn 对 MNIST 数据集进行交叉验证的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 pytorch 的新手，正在尝试实现一个前馈神经网络来对 mnist 数据集进行分类.我在尝试使用交叉验证时遇到了一些问题.我的数据具有以下形状:x_train:torch.Size([45000, 784]) 和y_train: torch.Size([45000])

我尝试使用 sklearn 的 KFold.

kfold =KFold(n_splits=10)

这是我的训练方法的第一部分，我将数据分成几部分:

for train_index, test_index in kfold.split(x_train, y_train):x_train_fold = x_train[train_index]x_test_fold = x_test[test_index]y_train_fold = y_train[train_index]y_test_fold = y_test[test_index]打印(x_train_fold.shape)对于范围内的纪元(纪元):...

y_train_fold 变量的索引是正确的，它只是:[ 0 1 2 ... 4497 4498 4499]，但它不适用于 x_train_fold，即 [ 4500 4501 4502 ... 44997 44998 44999]代码>.测试折叠也是如此.

对于第一次迭代，我希望变量 x_train_fold 成为前 4500 张图片，换句话说，具有形状 torch.Size([4500, 784])，但它的形状是 torch.Size([40500, 784])

关于如何正确处理的任何提示?

解决方案

我觉得你很困惑！

暂时忽略第二个维度，当你有 45000 个点，并且你使用 10 折交叉验证时，每折的大小是多少?45000/10 即 4500.

这意味着您的每个折叠将包含 4500 个数据点，其中一个折叠将用于测试，其余用于训练，即

<块引用>

用于测试: 1 折 => 4500 个数据点 => 大小:4500
对于训练:剩余折叠 => 45000-4500 个数据点 => 大小:45000-4500=40500

因此，对于第一次迭代，前 4500 个数据点(对应于索引)将用于测试，其余用于训练.(检查下图)

鉴于您的数据是 x_train: torch.Size([45000, 784]) 和 y_train: torch.Size([45000])，这就是您的代码应该看起来像:

for train_index, test_index in kfold.split(x_train, y_train):打印(train_index，test_index)x_train_fold = x_train[train_index]y_train_fold = y_train[train_index]x_test_fold = x_train[test_index]y_test_fold = y_train[test_index]打印(x_train_fold.shape，y_train_fold.shape)打印(x_test_fold.shape，y_test_fold.shape)休息[ 4500 4501 4502 ... 44997 44998 44999] [ 0 1 2 ... 4497 4498 4499]火炬大小([40500, 784]) 火炬大小([40500])火炬大小([4500, 784]) 火炬大小([4500])

所以，当你说

<块引用>

我希望变量 x_train_fold 成为前 4500 张图片...形状 torch.Size([4500, 784]).

你错了.这个大小对应于 x_test_fold.在第一次迭代中，基于 10 次折叠，x_train_fold 将有 40500 个点，因此其大小应该为 torch.Size([40500, 784]).>

I am new to pytorch and are trying to implement a feed forward neural network to classify the mnist data set. I have some problems when trying to use cross-validation. My data has the following shapes: x_train: torch.Size([45000, 784]) and y_train: torch.Size([45000])

I tried to use KFold from sklearn.

kfold =KFold(n_splits=10)

Here is the first part of my train method where I'm dividing the data into folds:

for  train_index, test_index in kfold.split(x_train, y_train): 
        x_train_fold = x_train[train_index]
        x_test_fold = x_test[test_index]
        y_train_fold = y_train[train_index]
        y_test_fold = y_test[test_index]
        print(x_train_fold.shape)
        for epoch in range(epochs):
         ...

The indices for the y_train_fold variable is right, it's simply: [ 0 1 2 ... 4497 4498 4499], but it's not for x_train_fold, which is [ 4500 4501 4502 ... 44997 44998 44999]. And the same goes for the test folds.

For the first iteration I want the varibale x_train_fold to be the first 4500 pictures, in other words to have the shape torch.Size([4500, 784]), but it has the shape torch.Size([40500, 784])

Any tips on how to get this right?

解决方案

I think you're confused!

Ignore the second dimension for a while, When you've 45000 points, and you use 10 fold cross-validation, what's the size of each fold? 45000/10 i.e. 4500.

It means that each of your fold will contain 4500 data points, and one of those fold will be used for testing, and the remaining for training i.e.

For testing: one fold => 4500 data points => size: 4500
For training: remaining folds => 45000-4500 data points => size: 45000-4500=40500

Thus, for first iteration, the first 4500 data points (corresponding to indices) will be used for testing and the rest for training. (Check below image)

Given your data is x_train: torch.Size([45000, 784]) and y_train: torch.Size([45000]), this is how your code should look like:

for train_index, test_index in kfold.split(x_train, y_train):  
    print(train_index, test_index)

    x_train_fold = x_train[train_index] 
    y_train_fold = y_train[train_index] 
    x_test_fold = x_train[test_index] 
    y_test_fold = y_train[test_index] 

    print(x_train_fold.shape, y_train_fold.shape) 
    print(x_test_fold.shape, y_test_fold.shape) 
    break 

[ 4500  4501  4502 ... 44997 44998 44999] [   0    1    2 ... 4497 4498 4499]
torch.Size([40500, 784]) torch.Size([40500])
torch.Size([4500, 784]) torch.Size([4500])

So, when you say

I want the variable x_train_fold to be the first 4500 picture... shape torch.Size([4500, 784]).

you're wrong. this size corresonds to x_test_fold. In the first iteration, based on 10 folds, x_train_fold will have 40500 points, thus its size is supposed to be torch.Size([40500, 784]).

这篇关于使用 pytorch 和 sklearn 对 MNIST 数据集进行交叉验证的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 pytorch 和 sklearn 对 MNIST 数据集进行交叉验证 [英] Cross validation for MNIST dataset with pytorch and sklearn

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 pytorch 和 sklearn 对 MNIST 数据集进行交叉验证 [英] Cross validation for MNIST dataset with pytorch and sklearn

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭