scikit-learn LinearRegression 的意外交叉验证分数 [英] Unexpected cross-validation scores with scikit-learn LinearRegression

查看:39
本文介绍了scikit-learn LinearRegression 的意外交叉验证分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习将 scikit-learn 用于一些基本的统计学习任务.我以为我已经成功地创建了适合我的数据的 LinearRegression 模型:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y,test_size=0.2,random_state=0)模型 = linear_model.LinearRegression()模型拟合(X_train,y_train)打印 model.score(X_test, y_test)

产生的结果:

0.797144744766

然后我想通过自动交叉验证进行多个类似的 4:1 拆分:

model = linear_model.LinearRegression()分数 = cross_validation.cross_val_score(model, X, y, cv=5)打印乐谱

我得到这样的输出:

[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]

交叉验证的分数怎么会和单次随机分割的分数相差这么大?他们都应该使用 r2 评分,如果我将 scoring='r2' 参数传递给 cross_val_score,结果是一样的.

对于 cross_validation.train_test_splitrandom_state 参数,我尝试了许多不同的选项,它们都给出了 0.7 到 0.9 范围内的相似分数.

我使用的是 sklearn 0.16.1 版

解决方案

train_test_split 似乎生成数据集的随机拆分,而 cross_val_score 使用连续集,即

当 cv 参数为整数时,cross_val_score 默认使用 KFold 或 StratifiedKFold 策略"

http://scikit-learn.org/stable/modules/cross_validation.html

取决于您的数据集的性质,例如一个段的长度上高度相关的数据,连续的集合将给出与例如不同的拟合大不相同的拟合.来自整个数据集的随机样本.

I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y,
    test_size=0.2, random_state=0)

model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)

Which yields:

0.797144744766

Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:

model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores

And I get output like this:

[ 0.04614495 -0.26160081 -3.11299397 -0.7326256  -1.04164369]

How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.

I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.

I am using sklearn version 0.16.1

解决方案

train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.

"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"

http://scikit-learn.org/stable/modules/cross_validation.html

Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.

这篇关于scikit-learn LinearRegression 的意外交叉验证分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆