留一法交叉验证 [英] Leave-one-out cross-validation

查看:564
本文介绍了留一法交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过留一法交叉验证来评估多变量数据集,然后删除那些无法预测原始数据集的样本(Benjamini校正,FDR > 10%).

I am trying to evaluate a multivariable dataset by leave-one-out cross-validation and then remove those samples not predictive of the original dataset (Benjamini-corrected, FDR > 10%).

使用有关交叉验证的文档,我发现了假单出迭代器.但是,当试图获得第n倍的分数时,会出现一个例外,表明需要多个样本.为什么.predict()不能工作,而.score()不能工作?如何获得单个样品的分数?我需要使用其他方法吗?

Using the docs on cross-validation, I've found the leave-one-out iterator. However, when trying to get the score for the nth fold, an exception is raised saying that more than one sample is needed. Why does .predict() work while .score() doesn't? How can I get the score for a single sample? Do I need to use another approach?

代码失败:

from sklearn import ensemble, cross_validation, datasets

dataset = datasets.load_linnerud()
x, y = dataset.data, dataset.target
clf = ensemble.RandomForestRegressor(n_estimators=500)

loo = cross_validation.LeaveOneOut(x.shape[0])
for train_i, test_i in loo:
    score = clf.fit(x[train_i], y[train_i]).score(x[test_i], y[test_i])
    print('Sample %d score: %f' % (test_i[0], score))

产生的异常:

ValueError: r2_score can only be computed given more than one sample.


:

我不是要问为什么这行不通,而是要用另一种方法行得通.在对模型进行拟合/训练之后,如何测试单个样本对训练模型的拟合程度?

I am not asking why this doesn't work, but for a different approach that does. After fitting/training my model, how do I test how good a single sample fits the trained model?

推荐答案

cross_validation.LeaveOneOut(x.shape[0])正在创建与行数一样多的折叠.这样一来,每次验证运行只会得到一个实例.

cross_validation.LeaveOneOut(x.shape[0]) is creating as many folds as the number of rows. This results in each validation run getting only one instance.

现在,要画一条线",您需要两个点,而对于一个实例,您只有一个点.这就是您的错误消息所说的,它需要多个实例(或示例)来绘制用于计算r ^ 2值的线".

Now, to draw a "line" you need two points, whereas with your one instance, you only have one point. That's what your error message says, that it needs more than one instance (or sample) to draw the "line" that will be used to calculate the r^2 value.

通常,在ML世界中,人们报告10倍或5倍的交叉验证结果.因此,我建议相应地将n设置为10或5.

Generally, in the ML world, people report 10-fold or 5-fold cross validation result. So I would recommend setting the n to 10 or 5, accordingly.

在与@banana进行了快速讨论之后,我们意识到最初对这个问题的理解不正确.由于不可能获得单个数据点的R2分数,因此另一种方法是计算实际点与预测点之间的距离.这可以使用 numpy.linalg.norm(clf.predict(x[test_i])[0] - y[test_i])

After a quick discussion with @banana, we realized that the question was not understood correctly initially. Since it is not possible to get the R2 score for a single data point, an alternative is to calculate the distance between the actual and predicted points. This can be done using numpy.linalg.norm(clf.predict(x[test_i])[0] - y[test_i])

这篇关于留一法交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆