GridSearchCV 结果的意外平均值 [英] Unexpected average of GridSearchCV results

查看:36
本文介绍了GridSearchCV 结果的意外平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图理解为什么我会遇到以下情况 - 我正在使用 iris 数据并使用 k-最近邻分类器进行交叉验证以选择最佳 k.

from sklearn.neighbors import KNeighborsClassifier从 sklearn 导入 grid_search从 sklearn.cross_validation 导入 train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)参数 = {'n_neighbors': range(1,21)}knn = sklearn.neighbors.KNeighborsClassifier()clf = grid_search.GridSearchCV(knn, parameters,cv=10)clf.fit(X_train, Y_train)

clf 对象有结果.

打印 clf.grid_scores_

<块引用>

[mean: 0.94000, std: 0.08483, params: {'n_neighbors': 1}, mean: 0.93000, std: 0.08251, params: {'n_neighbors': 2}, mean: 0.94000,6params: 1}{'n_neighbors': 3},均值:0.95000,标准:0.08101,参数:{'n_neighbors':4},均值:0.95000,标准:0.08562,参数:{'n_neighbors':5},30:标准00.08284,参数:{'n_neighbors':6},均值:0.95000,标准:0.08512,参数:{'n_neighbors':7},均值:0.94000,标准:0.08414,参数:{s'n_neigh},标准:0.085120.94000,标准:0.08414,参数:{'n_neighbors':9},均值:0.94000,标准:0.08414,参数:{'n_neighbors':10},均值:0.94000,标准:{'n_neighbors':9},标准:0,0.08 params1'},均值:0.93000,标准差:0.08284,参数:{'n_neighbors':12},均值:0.93000,标准差:0.08284,参数:{'n_neighbors':13},均值:0.94000,标准差:44141n_neighbors':14},均值:0.94000,标准:0.08483,参数:{'n_neighbors':15},均值:0.93000,标准:0.08284,参数:{'n_neighbors':16.04,标准:800参数:{'n_neighbors':17},平均值:0.93000,标准:0.09458, 参数: {'n_neighbors': 18}, 平均值: 0.94000, std: 0.08483, 参数: {'n_neighbors': 19}, 平均值: 0.93000, std: 0.10887, 参数: params20s']>

然而,当我得到第一种情况的 10 个 CV 结果时 k=1

打印 clf.grid_scores_[0].cv_validation_scores

我们得到

array([ 1. , 0.90909091, 1. , 0.72727273, 0.9 ,1. , 1. , 1. , 1. , 0.88888889])

然而,这 10 次观察的平均值

打印 clf.grid_scores_[0].cv_validation_scores.mean()

是 0.942525252525,而不是对象上显示的 0.940000.

所以,我很困惑平均值在做什么以及为什么它不一样.我阅读了文档,但没有找到任何对我有帮助的内容.我错过了什么?

解决方案

GridSearchCV 是iid".它的默认值是 True,描述如下:

如果为 True,则假设数据在折叠中均匀分布,最小化的损失是每个样本的总损失,而不是折叠的平均损失.

本质上,grid_scores_ 函数默认输出所有样本的平均损失,而不是折叠的平均损失.如果每个折叠中的数据点数量不同(即如果样本数量不能被 10 整除,因为您正在进行 10 倍交叉验证),那么这些数字将不匹配.

I am trying to understand why I am getting the following situation - I am using the iris data and was doing cross-validation with a k-nearest neighbors classifier to choose the best k.

from sklearn.neighbors import KNeighborsClassifier
from sklearn import grid_search
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.33, random_state=42)

parameters = {'n_neighbors': range(1,21)}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = grid_search.GridSearchCV(knn, parameters,cv=10)
clf.fit(X_train, Y_train)

The clf object has the results.

print clf.grid_scores_

[mean: 0.94000, std: 0.08483, params: {'n_neighbors': 1}, mean: 0.93000, std: 0.08251, params: {'n_neighbors': 2}, mean: 0.94000, std: 0.08456, params: {'n_neighbors': 3}, mean: 0.95000, std: 0.08101, params: {'n_neighbors': 4}, mean: 0.95000, std: 0.08562, params: {'n_neighbors': 5}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 6}, mean: 0.95000, std: 0.08512, params: {'n_neighbors': 7}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 8}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 9}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 10}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 11}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 12}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 13}, mean: 0.94000, std: 0.08414, params: {'n_neighbors': 14}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 15}, mean: 0.93000, std: 0.08284, params: {'n_neighbors': 16}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 17}, mean: 0.93000, std: 0.09458, params: {'n_neighbors': 18}, mean: 0.94000, std: 0.08483, params: {'n_neighbors': 19}, mean: 0.93000, std: 0.10887, params: {'n_neighbors': 20}]

however when I get the 10 CV results for the first case k=1

print clf.grid_scores_[0].cv_validation_scores

we get

array([ 1.        ,  0.90909091,  1.        ,  0.72727273,  0.9       ,
        1.        ,  1.        ,  1.        ,  1.        ,  0.88888889])

However, the mean of these 10 observations

print clf.grid_scores_[0].cv_validation_scores.mean()

is 0.942525252525, not the 0.940000 presented on the object.

So, I am very confused as to what the mean value is doing and why it is not the same. I read the documentation and I did not find anything that would help me. What am I missing?

解决方案

One of the parameters of GridSearchCV is "iid". It takes a default value of True, and the description reads:

If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.

Essentially, the grid_scores_ function by default outputs the mean loss across all the samples rather than the mean loss across the folds. If the number of data points in each fold is not the same (i.e. if the number of samples is not divisible by 10, since you're doing 10-fold cross validation), then these numbers won't match.

这篇关于GridSearchCV 结果的意外平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆