scikit-learn LinearRegression 的意外交叉验证分数 [英] Unexpected cross-validation scores with scikit-learn LinearRegression
问题描述
我正在尝试学习将 scikit-learn 用于一些基本的统计学习任务.我以为我已经成功地创建了适合我的数据的 LinearRegression 模型:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y,test_size=0.2,random_state=0)模型 = linear_model.LinearRegression()模型拟合(X_train,y_train)打印 model.score(X_test, y_test)
产生的结果:
0.797144744766
然后我想通过自动交叉验证进行多个类似的 4:1 拆分:
model = linear_model.LinearRegression()分数 = cross_validation.cross_val_score(model, X, y, cv=5)打印乐谱
我得到这样的输出:
[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]
交叉验证的分数怎么会和单次随机分割的分数相差这么大?他们都应该使用 r2 评分,如果我将 scoring='r2'
参数传递给 cross_val_score
,结果是一样的.
对于 cross_validation.train_test_split
的 random_state
参数,我尝试了许多不同的选项,它们都给出了 0.7 到 0.9 范围内的相似分数.>
我使用的是 sklearn 0.16.1 版
train_test_split 似乎生成数据集的随机拆分,而 cross_val_score 使用连续集,即
当 cv 参数为整数时,cross_val_score 默认使用 KFold 或 StratifiedKFold 策略"
http://scikit-learn.org/stable/modules/cross_validation.html
取决于您的数据集的性质,例如一个段的长度上高度相关的数据,连续的集合将给出与例如不同的拟合大不相同的拟合.来自整个数据集的随机样本.
I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y,
test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)
Which yields:
0.797144744766
Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:
model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores
And I get output like this:
[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]
How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2'
parameter to cross_val_score
.
I've tried a number of different options for the random_state
parameter to cross_validation.train_test_split
, and they all give similar scores in the 0.7 to 0.9 range.
I am using sklearn version 0.16.1
train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.
"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"
http://scikit-learn.org/stable/modules/cross_validation.html
Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.
这篇关于scikit-learn LinearRegression 的意外交叉验证分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!