如何从LogisticRegressionCV和GridSearchCV获得可比较和可再现的结果 [英] How to get comparable and reproducible results from LogisticRegressionCV and GridSearchCV

查看:93
本文介绍了如何从LogisticRegressionCV和GridSearchCV获得可比较和可再现的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用不同的参数给不同的分类器评分.

为了提高 LogisticRegression 的速度,我使用了 LogisticRegressionCV (至少快了2倍),并计划使用 GridSearchCV 进行其他操作.

但是问题出在给我相等的 C 参数,而不是 AUC ROC 得分上.

我将尝试修复许多参数,例如 scorer random_state solver max_iter ...请看示例(真实数据没有资料):

测试数据和共同部分:

sklearn导入数据集中的

 波士顿= datasets.load_boston()X =波士顿数据y =波士顿目标y [y< = y.mean()] = 0;y [y>0] = 1将numpy导入为np从sklearn.cross_validation导入KFold从sklearn.linear_model导入LogisticRegression从sklearn.grid_search导入GridSearchCV从sklearn.linear_model导入LogisticRegressionCVfold = KFold(len(y),n_folds = 5,shuffle = True,random_state = 777) 

GridSearchCV

  grid = {'C':np.power(10.0,np.arange(-10,10)),'solver':['newton-cg']}clf = LogisticRegression(惩罚='l2',random_state = 777,max_iter = 10000,tol = 10)gs = GridSearchCV(clf,grid,scoring ='roc_auc',cv = fold)gs.fit(X,y)打印("gs.best_score _:",gs.best_score_) 

gs.best_score_:0.939162082194

LogisticRegressionCV

  searchCV = LogisticRegressionCV(Cs = list(np.power(10.0,np.arange(-10,10))),penalty ='l2',scoring ='roc_auc',cv = fold,random_state=777,max_iter = 10000,fit_intercept =真,solver ='牛顿-cg',tol = 10)searchCV.fit(X,y)打印("Max auc_roc:",searchCV.scores_ [1] .max()) 

最大auc_roc:0.970588235294

Solver newton-cg 仅用于提供固定值,其他也尝试过.我忘记了什么?

P.S.在这两种情况下,我还收到警告"/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193:用户警告:行搜索失败warnings.warn(行搜索失败")"我也听不懂.如果有人也描述它的意思,我会很高兴,但我希望它与我的主要问题无关.

编辑更新

通过@joeln注释添加 max_iter = 10000 tol = 10 参数.结果没有任何变化,但警告消失了.

解决方案

此处是汤姆()在scikit-learn问题跟踪器上:

LogisticRegressionCV.scores _ 给出所有折痕的分数. GridSearchCV.best_score _ 在所有折叠中给出最佳平均得分.

要获得相同的结果,您需要更改代码:

  print('Max auc_roc:',searchCV.scores_ [1] .max())#错误print('Max auc_roc:',searchCV.scores_ [1] .mean(axis = 0).max())#是正确的 

通过使用默认的 tol = 1e-4 而不是您的 tol = 10 ,我得到:

 ('gs.best_score_:',0.939162082193857)('Max auc_roc:',0.93915947999923843) 

(小的)剩余差异可能来自 LogisticRegressionCV 中的热启动(实际上是使它比 GridSearchCV 更快的原因).

I want to score different classifiers with different parameters.

For speedup on LogisticRegression I use LogisticRegressionCV (which at least 2x faster) and plan use GridSearchCV for others.

But problem while it give me equal C parameters, but not the AUC ROC scoring.

I'll try fix many parameters like scorer, random_state, solver, max_iter, tol... Please look at example (real data have no mater):

Test data and common part:

from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1

import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV

fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)

GridSearchCV

grid = {
    'C': np.power(10.0, np.arange(-10, 10))
     , 'solver': ['newton-cg']
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
gs.fit(X, y)

print ('gs.best_score_:', gs.best_score_)

gs.best_score_: 0.939162082194

LogisticRegressionCV

searchCV = LogisticRegressionCV(
    Cs=list(np.power(10.0, np.arange(-10, 10)))
    ,penalty='l2'
    ,scoring='roc_auc'
    ,cv=fold
    ,random_state=777
    ,max_iter=10000
    ,fit_intercept=True
    ,solver='newton-cg'
    ,tol=10
)
searchCV.fit(X, y)

print ('Max auc_roc:', searchCV.scores_[1].max())

Max auc_roc: 0.970588235294

Solver newton-cg used just to provide fixed value, other tried too. What I forgot?

P.S. In both cases I also got warning "/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193: UserWarning: Line Search failed warnings.warn('Line Search failed')" which I can't understand too. I'll be happy if someone also describe what it mean, but I hope it is not relevant to my main question.

EDIT UPDATES

By @joeln comment add max_iter=10000 and tol=10 parameters too. It does not change result in any digit, but the warning disappeared.

解决方案

Here is a copy of the answer by Tom on the scikit-learn issue tracker:

LogisticRegressionCV.scores_ gives the score for all the folds. GridSearchCV.best_score_ gives the best mean score over all the folds.

To get the same result, you need to change your code:

print('Max auc_roc:', searchCV.scores_[1].max())  # is wrong
print('Max auc_roc:', searchCV.scores_[1].mean(axis=0).max())  # is correct

By also using the default tol=1e-4 instead of your tol=10, I get:

('gs.best_score_:', 0.939162082193857)
('Max auc_roc:', 0.93915947999923843)

The (small) remaining difference might come from warm starting in LogisticRegressionCV (which is actually what makes it faster than GridSearchCV).

这篇关于如何从LogisticRegressionCV和GridSearchCV获得可比较和可再现的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆