numpy的:如何分割/分区数据集(数组)到训练和测试数据集进行,例如,交叉验证? [英] Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation?
问题描述
什么是随机拆分numpy的阵列到训练和测试/验证数据的好方法?类似的事情在Matlab的cvpartition或crossvalind功能。
What is a good way to split a numpy array randomly into training and testing / validation dataset? Something similar to the cvpartition or crossvalind functions in Matlab.
推荐答案
如果您希望将数据拆分成两半设置一次,就可以使用 numpy.random.shuffle
或 numpy.random.permutation
如果你需要跟踪指数:
If you want to divide the data set once in two halves, you can use numpy.random.shuffle
, or numpy.random.permutation
if you need to keep track of the indices:
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]
或
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]
有很多方法可以反复分区相同的数据,交叉验证设置。一个策略是从数据重新取样,用重复:
There are many ways to repeatedly partition the same data set for cross validation. One strategy is to resample from the dataset, with repetition:
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]
最后, scikits.learn 包含若干交叉验证方法(K倍,留下正出,分层-K倍,...)。对于文档,你可能需要看一下实例或最新的git仓库,但code看上去结实。
Finally, scikits.learn contains several cross validation methods (k-fold, leave-n-out, stratified-k-fold, ...). For the docs you might need to look at the examples or the latest git repository, but the code looks solid.
这篇关于numpy的:如何分割/分区数据集(数组)到训练和测试数据集进行,例如,交叉验证?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!