如何将数据集拆分/划分为训练和测试数据集,例如交叉验证? [英] How to split/partition a dataset into training and test datasets for, e.g., cross validation?

查看:102
本文介绍了如何将数据集拆分/划分为训练和测试数据集,例如交叉验证?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将 NumPy 数组随机拆分为训练和测试/验证数据集的好方法是什么?类似于 Matlab 中的 cvpartitioncrossvalind 函数.

What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the cvpartition or crossvalind functions in Matlab.

推荐答案

如果你想把数据集一次拆分成两部分,可以使用numpy.random.shuffle,或者numpy.random.permutation 如果您需要跟踪索引(请记住修复随机种子以使所有内容都可重现):

If you want to split the data set once in two parts, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices (remember to fix the random seed to make everything reproducible):

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

有很多其他方法可以对相同数据进行重复分区设置交叉验证.其中许多在 sklearn可用图书馆(k-fold,leave-n-out,...).sklearn 还包括更高级的 分层抽样"方法创建一个关于某些特征平衡的数据分区,例如确保训练和测试集中的正例和负例的比例相同.

There are many ways other ways to repeatedly partition the same data set for cross validation. Many of those are available in the sklearn library (k-fold, leave-n-out, ...). sklearn also includes more advanced "stratified sampling" methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.

这篇关于如何将数据集拆分/划分为训练和测试数据集,例如交叉验证?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆