GridSearchCV 结果的意外平均值 [英] Unexpected average of GridSearchCV results
问题描述
我试图理解为什么我会遇到以下情况 - 我正在使用 iris 数据并使用 k-最近邻分类器进行交叉验证以选择最佳 k.
from sklearn.neighbors import KNeighborsClassifier从 sklearn 导入 grid_search从 sklearn.cross_validation 导入 train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)参数 = {'n_neighbors': range(1,21)}knn = sklearn.neighbors.KNeighborsClassifier()clf = grid_search.GridSearchCV(knn, parameters,cv=10)clf.fit(X_train, Y_train)
clf
对象有结果.
打印 clf.grid_scores_
<块引用>
[mean: 0.94000, std: 0.08483, params: {'n_neighbors': 1}, mean: 0.93000, std: 0.08251, params: {'n_neighbors': 2}, mean: 0.94000,6params: 1}{'n_neighbors': 3},均值:0.95000,标准:0.08101,参数:{'n_neighbors':4},均值:0.95000,标准:0.08562,参数:{'n_neighbors':5},30:标准00.08284,参数:{'n_neighbors':6},均值:0.95000,标准:0.08512,参数:{'n_neighbors':7},均值:0.94000,标准:0.08414,参数:{s'n_neigh},标准:0.085120.94000,标准:0.08414,参数:{'n_neighbors':9},均值:0.94000,标准:0.08414,参数:{'n_neighbors':10},均值:0.94000,标准:{'n_neighbors':9},标准:0,0.08 params1'},均值:0.93000,标准差:0.08284,参数:{'n_neighbors':12},均值:0.93000,标准差:0.08284,参数:{'n_neighbors':13},均值:0.94000,标准差:44141n_neighbors':14},均值:0.94000,标准:0.08483,参数:{'n_neighbors':15},均值:0.93000,标准:0.08284,参数:{'n_neighbors':16.04,标准:800参数:{'n_neighbors':17},平均值:0.93000,标准:0.09458, 参数: {'n_neighbors': 18}, 平均值: 0.94000, std: 0.08483, 参数: {'n_neighbors': 19}, 平均值: 0.93000, std: 0.10887, 参数: params20s']>
然而,当我得到第一种情况的 10 个 CV 结果时 k=1
打印 clf.grid_scores_[0].cv_validation_scores
我们得到
array([ 1. , 0.90909091, 1. , 0.72727273, 0.9 ,1. , 1. , 1. , 1. , 0.88888889])
然而,这 10 次观察的平均值
打印 clf.grid_scores_[0].cv_validation_scores.mean()
是 0.942525252525,而不是对象上显示的 0.940000.
所以,我很困惑平均值在做什么以及为什么它不一样.我阅读了文档,但没有找到任何对我有帮助的内容.我错过了什么?
GridSearchCV 是iid".它的默认值是 True,描述如下:
如果为 True,则假设数据在折叠中均匀分布,最小化的损失是每个样本的总损失,而不是折叠的平均损失.
本质上,grid_scores_ 函数默认输出所有样本的平均损失,而不是折叠的平均损失.如果每个折叠中的数据点数量不同(即如果样本数量不能被 10 整除,因为您正在进行 10 倍交叉验证),那么这些数字将不匹配.
I am trying to understand why I am getting the following situation - I am using the iris data and was doing cross-validation with a k-nearest neighbors classifier to choose the best k.
from sklearn.neighbors import KNeighborsClassifier
from sklearn import grid_search
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.33, random_state=42)
parameters = {'n_neighbors': range(1,21)}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = grid_search.GridSearchCV(knn, parameters,cv=10)
clf.fit(X_train, Y_train)
The clf
object has the results.
print clf.grid_scores_
[mean: 0.94000, std: 0.08483, params: {'n_neighbors': 1}, mean: 0.93000, std: 0.08251, params: {'n_neighbors': 2}, mean: 0.94000, std: 0.08456, params: {'n_neighbors': 3}, mean: 0.95000, std: 0.08101, params: {'n_neighbors': 4}, mean: 0.95000, std: 0.08562, params: {'n_neighbors': 5}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 6}, mean: 0.95000, std: 0.08512, params: {'n_neighbors': 7}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 8}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 9}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 10}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 11}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 12}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 13}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 14}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 15}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 16}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 17}, mean: 0.93000, std: 0.09458, params: {'n_neighbors': 18}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 19}, mean: 0.93000, std: 0.10887, params: {'n_neighbors': 20}]
however when I get the 10 CV results for the first case k=1
print clf.grid_scores_[0].cv_validation_scores
we get
array([ 1. , 0.90909091, 1. , 0.72727273, 0.9 ,
1. , 1. , 1. , 1. , 0.88888889])
However, the mean of these 10 observations
print clf.grid_scores_[0].cv_validation_scores.mean()
is 0.942525252525, not the 0.940000 presented on the object.
So, I am very confused as to what the mean value is doing and why it is not the same. I read the documentation and I did not find anything that would help me. What am I missing?
One of the parameters of GridSearchCV is "iid". It takes a default value of True, and the description reads:
If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.
Essentially, the grid_scores_ function by default outputs the mean loss across all the samples rather than the mean loss across the folds. If the number of data points in each fold is not the same (i.e. if the number of samples is not divisible by 10, since you're doing 10-fold cross validation), then these numbers won't match.
这篇关于GridSearchCV 结果的意外平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!