使用 scikit-learn 进行回归模型评估 [英] regression model evaluation using scikit-learn

查看:94
本文介绍了使用 scikit-learn 进行回归模型评估的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 sklearn 进行回归并使用随机网格搜索来评估不同的参数.这是一个玩具示例:

from sklearn.datasets import make_regression从 sklearn.metrics 导入 mean_squared_error, make_scorer从 scipy.stats 导入 randint 作为 sp_randint从 sklearn.ensemble 导入 ExtraTreesRegressor从 sklearn.cross_validation 导入 LeaveOneOut从 sklearn.grid_search 导入 GridSearchCV、RandomizedSearchCVX, y = make_regression(n_samples=10,n_features=10,n_informative=3,随机状态=0,洗牌=假)clf = ExtraTreesRegressor(random_state=12)param_dist = {"n_estimators": [5, 10],"max_depth": [3, 无],"max_features": sp_randint(1, 11),"min_samples_split": sp_randint(1, 11),"min_samples_leaf": sp_randint(1, 11),引导程序":[真,假]}rmse = make_scorer(mean_squared_error,greater_is_better=False)r = RandomizedSearchCV(clf, param_distributions=param_dist,cv=10,评分='mean_squared_error',n_iter=3,n_jobs=2)r.fit(X, y)

我的问题是:

1) RandomizedSearchCV 是否使用 r2 作为评分函数?没有记录用于回归的默认评分函数是什么.

2) 即使我在代码中使用了 mean_squared_error 作为评分函数,为什么分数是负数(如下所示)?mean_squared_error 都应该是正的.然后当我计算 r.score(X,y) 时,它似乎再次报告 R2.所有这些情况下的分数让我很困惑.

在 [677]: r.grid_scores_出[677]:[mean: -35.18642, std: 13.81538, params: {'bootstrap': True, 'min_samples_leaf': 9, 'n_estimators': 5, 'min_samples_split': 3, 'max_features': 3, 'max_depth': 3平均值:-15.07619,标准:6.77384,参数:{'bootstrap':假,'min_samples_leaf':7,'n_estimators':10,'min_samples_split':10,'max_features':10,'max_depth':无平均值:-17.91087,标准:8.97279,参数:{'bootstrap':真,'min_samples_leaf':7,'n_estimators':10,'min_samples_split':7,'max_features':7,'max_depth':无}]在 [678]: r.grid_scores_[0].cv_validation_scores出[678]:数组([-37.74058826, -26.73444271, -36.15443525, -23.11874605,-33.60726519, -33.4821689, -36.14897322, -43.80499446,-68.50480995, -12.97342433])在 [680] 中:r.score(X,y)出[680]:0.87989839693054017

解决方案

  1. 就像 GridSearchCV 一样,RandomizedSearchCV 默认使用估计器上的 score 方法.ExtraTreesRegressor 和其他回归估计器从此方法返回 R² 分数(分类器返回准确性).

  2. 惯例是分数是最大化的东西.均方误差是一个要最小化的损失函数,因此在搜索中将其取反.

<块引用>

然后当我计算 r.score(X,y) 时,它似乎又报告了 R2.

那不漂亮.这可以说是一个错误.

I am doing regression with sklearn and use random grid search to evaluate different parameters. Here is a toy example:

from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, make_scorer
from scipy.stats import randint as sp_randint
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.cross_validation import LeaveOneOut
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
X, y = make_regression(n_samples=10,
                       n_features=10,
                       n_informative=3,
                       random_state=0,
                       shuffle=False)

clf = ExtraTreesRegressor(random_state=12)
param_dist = {"n_estimators": [5, 10],
              "max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(1, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False]}
rmse = make_scorer(mean_squared_error, greater_is_better=False)
r = RandomizedSearchCV(clf, param_distributions=param_dist,
                       cv=10,
                       scoring='mean_squared_error',
                       n_iter=3,
                       n_jobs=2)
r.fit(X, y)

My questions are:

1) does RandomizedSearchCV use r2 as scoring function? It is not documented what the default scoring function is for regression.

2) Even I used mean_squared_error as scoring function in the code, why the scores are negative (shown below)? mean_squared_error should all be positive. And then when I calculate r.score(X,y), it seems reporting R2 again. The scores in all these contexts are very confusing to me.

In [677]: r.grid_scores_
Out[677]: 
[mean: -35.18642, std: 13.81538, params: {'bootstrap': True, 'min_samples_leaf': 9, 'n_estimators': 5, 'min_samples_split': 3, 'max_features': 3, 'max_depth': 3},
 mean: -15.07619, std: 6.77384, params: {'bootstrap': False, 'min_samples_leaf': 7, 'n_estimators': 10, 'min_samples_split': 10, 'max_features': 10, 'max_depth': None},
 mean: -17.91087, std: 8.97279, params: {'bootstrap': True, 'min_samples_leaf': 7, 'n_estimators': 10, 'min_samples_split': 7, 'max_features': 7, 'max_depth': None}]

In [678]: r.grid_scores_[0].cv_validation_scores
Out[678]: 
array([-37.74058826, -26.73444271, -36.15443525, -23.11874605,
       -33.60726519, -33.4821689 , -36.14897322, -43.80499446,
       -68.50480995, -12.97342433])

In [680]: r.score(X,y)
Out[680]: 0.87989839693054017

解决方案

  1. Just like GridSearchCV, RandomizedSearchCV uses the score method on the estimator by default. ExtraTreesRegressor and other regression estimators return the R² score from this method (classifiers return accuracy).

  2. The convention is that a score is something to maximize. Mean squared error is a loss function to minimize, so it's negated inside the search.

And then when I calculate r.score(X,y), it seems reporting R2 again.

That's not pretty. It's arguably a bug.

这篇关于使用 scikit-learn 进行回归模型评估的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆