cross_val_score 和 cross_val_predict 的区别 [英] Difference between cross_val_score and cross_val_predict

查看:47
本文介绍了cross_val_score 和 cross_val_predict 的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用交叉验证来评估使用 scikitlearn 构建的回归模型并感到困惑,我应该使用 cross_val_scorecross_val_predict 这两个函数中的哪一个.一种选择是:

cvs = DecisionTreeRegressor(max_depth = depth)分数 = cross_val_score(cvs,预测变量,目标,cv=cvfolds,评分='r2')打印(R2-Score:%0.2f(+/- %0.2f)"%(scores.mean(),scores.std()* 2))

另一个,使用带有标准 r2_score 的 cv 预测:

cvp = DecisionTreeRegressor(max_depth = depth)预测 = cross_val_predict(cvp,预测变量,目标,cv=cvfolds)打印(CV R^2-Score:{}".format(r2_score(df[target],predictions_cv)))

我会假设这两种方法都是有效的并且给出了相似的结果.但这只是小 k 折的情况.虽然 10 倍 cv 的 r^2 大致相同,但在使用cross_vall_score"的第一个版本的情况下,对于更高的 k 值,它变得越来越低.第二个版本几乎不受折叠数变化的影响.

这种行为是否在意料之中?我是否对 SKLearn 中的 CV 缺乏一些了解?

解决方案

cross_val_score 返回测试折叠的分数,其中 cross_val_predict 返回测试折叠的预测 y 值.

对于cross_val_score(),您使用的是输出的平均值,这将受到折叠次数的影响,因为这样可能会有一些折叠,这些折叠可能具有很高的误差(不适合).

然而,对于输入中的每个元素,cross_val_predict() 返回该元素在测试集中时获得的预测.[请注意,只有将所有元素分配给测试集一次的交叉验证策略才能使用].所以增加折叠次数,只会增加测试元素的训练数据,因此对结果可能不会有太大影响.

编辑(评论后)

请查看以下有关 cross_val_predict 工作原理的答案:

scikit-learn cross_val_predict 准确率分数是如何计算的?

我认为 cross_val_predict 会过拟合,因为随着折叠次数的增加,更多的数据将用于训练而更少的数据用于测试.所以得到的标签更依赖于训练数据.同样如上所述,一个样本的预测只进行一次,因此它可能更容易受到数据分裂的影响.这就是为什么大多数地方或教程都推荐使用 cross_val_score 进行分析.

I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score and cross_val_predict I should use. One option would be :

cvs = DecisionTreeRegressor(max_depth = depth)
scores = cross_val_score(cvs, predictors, target, cv=cvfolds, scoring='r2')
print("R2-Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

An other one, to use the cv-predictions with the standard r2_score:

cvp = DecisionTreeRegressor(max_depth = depth)
predictions = cross_val_predict(cvp, predictors, target, cv=cvfolds)
print ("CV R^2-Score: {}".format(r2_score(df[target], predictions_cv)))

I would assume that both methods are valid and give similar results. But that is only the case with small k-folds. While the r^2 is roughly the same for 10-fold-cv, it gets increasingly lower for higher k-values in the case of the first version using "cross_vall_score". The second version is mostly unaffected by changing numbers of folds.

Is this behavior to be expected and do I lack some understanding regarding CV in SKLearn?

解决方案

cross_val_score returns score of test fold where cross_val_predict returns predicted y values for the test fold.

For the cross_val_score(), you are using the average of the output, which will be affected by the number of folds because then it may have some folds which may have high error (not fit correctly).

Whereas, cross_val_predict() returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. [Note that only cross-validation strategies that assign all elements to a test set exactly once can be used]. So the increasing the number of folds, only increases the training data for the test element, and hence its result may not be affected much.

Edit (after comment)

Please have a look the following answer on how cross_val_predict works:

How is scikit-learn cross_val_predict accuracy score calculated?

I think that cross_val_predict will be overfit because as the folds increase, more data will be for train and less will for test. So the resultant label is more dependent on training data. Also as already told above, the prediction for one sample is done only once, so it may be susceptible to the splitting of data more. Thats why most of the places or tutorials recommend using the cross_val_score for analysis.

这篇关于cross_val_score 和 cross_val_predict 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆