为什么不能获得与GridSearchCV相同的结果? [英] Why can't I get the same results as GridSearchCV?

查看:55
本文介绍了为什么不能获得与GridSearchCV相同的结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

GridSearchCV 仅返回每个参数化的分数,我也希望看到Roc曲线以更好地理解结果.为了做到这一点,我想从 GridSearchCV 中获得性能最好的模型,并重现这些相同的结果,但是要缓存这些概率.这是我的代码

GridSearchCV only returns a score for each parametrization and I would like to see an Roc Curve as well to better understand the results. In order to do this, I would like to take the best performing model from GridSearchCV and reproduce these same results but cache the probabilities. Here is my code

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from tqdm import tqdm

import warnings
warnings.simplefilter("ignore")

data = make_classification(n_samples=100, n_features=20, n_classes=2, 
                           random_state=1, class_sep=0.1)
X, y = data


small_pipe = Pipeline([
    ('rfs', SelectFromModel(RandomForestClassifier(n_estimators=100))), 
    ('clf', LogisticRegression())
])

params = {
    'clf__class_weight': ['balanced'],
    'clf__penalty'     : ['l1', 'l2'],
    'clf__C'           : [0.1, 0.5, 1.0],
    'rfs__max_features': [3, 5, 10]
}
key_feats = ['mean_train_score', 'mean_test_score', 'param_clf__C', 
             'param_clf__penalty', 'param_rfs__max_features']

skf = StratifiedKFold(n_splits=5, random_state=0)

all_results = list()
for _ in tqdm(range(25)):
    gs = GridSearchCV(small_pipe, param_grid=params, scoring='roc_auc', cv=skf, n_jobs=-1);
    gs.fit(X, y);
    results = pd.DataFrame(gs.cv_results_)[key_feats]
    all_results.append(results)


param_group = ['param_clf__C', 'param_clf__penalty', 'param_rfs__max_features']
all_results_df = pd.concat(all_results)
all_results_df.groupby(param_group).agg(['mean', 'std']
                    ).sort_values(('mean_test_score', 'mean'), ascending=False).head(20)

这是我尝试重现结果的尝试

Here is my attempt at reproducing the results

small_pipe_w_params = Pipeline([
    ('rfs', SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=3)), 
    ('clf', LogisticRegression(class_weight='balanced', penalty='l2', C=0.1))
])
skf = StratifiedKFold(n_splits=5, random_state=0)
all_scores = list()
for _ in range(25):
    scores = list()
    for train, test in skf.split(X, y):
        small_pipe_w_params.fit(X[train, :], y[train])
        probas = small_pipe_w_params.predict_proba(X[test, :])[:, 1]
        # cache probas here to build an Roc w/ conf interval later
        scores.append(roc_auc_score(y[test], probas))
    all_scores.extend(scores)

print('mean: {:<1.3f}, std: {:<1.3f}'.format(np.mean(all_scores), np.std(all_scores)))

由于结果似乎不稳定,我多次运行以上命令.我创建了一个具有挑战性的数据集,因为我自己的数据集同样难以学习.该groupby可以采用 GridSearchCV 的所有迭代以及平均值&std训练和测试分数以稳定结果.然后,我挑选出性能最好的模型(在我的最新模型中为C = 0.1,p刑罚= l2,max_features = 3),并在我故意将这些参数放入时尝试重现这些相同的结果.

I'm running the above multiple times as the results seem unstable. I have created a challenging dataset as my own dataset is equally as hard to learn. The groupby is meant to take all iterations of GridSearchCV and average & std the train and test scores to stabilize results. I then pick out the best performing model (C=0.1, penalty=l2 and max_features=3 in my most recent model) and try to reproduce these same results when I put those params in deliberately.

GridSearchCV 模型的平均得分为0.63,标准roc得分为0.042,而我自己的实现为0.59的平均值和0.131 roc.网格搜索分数要好得多.如果我针对GSCV和我自己的实验进行了100次迭代实验,结果将是相似的.

The GridSearchCV model yields a 0.63 mean and 0.042 std roc score whereas my own implementation gets 0.59 mean and std 0.131 roc. The grid search scores are considerably better. If I run this experiment out for 100 iterations for both GSCV and my own, the results are similar.

为什么这些结果不一样?当提供cv的整数时,它们都在内部使用 StratifiedKFold() ...也许 GridSearchCV 通过折叠大小对分数加权?我不确定,尽管如此.我的实现有缺陷吗?

Why are these results not the same? They both internally use StratifiedKFold() when an integer for cv is supplied... and maybe GridSearchCV weights the scores by size of fold? I'm not sure on that, it would make sense though. Is my implementation flawed?

random_state 已添加到SKFold

edit: random_state added to SKFold

推荐答案

如果您将 RandomForestClassifier 的random_state设置为set,则会消除不同 girdsearchCV 之间的差异.

If you set the set the random_state of the RandomForestClassifier, the variation between different girdsearchCV would be eliminated.

为简单起见,我将n_estimators设置为10,并得到以下结果

For simplification, I have set n_estimators =10 and got the following result

                                                             mean_train_score           mean_test_score
param_clf__C    param_clf__penalty  param_ rfs_max_features       mean        std     mean          std         
        1.0      l2                   5 0.766701    0.000000    0.580727    0.0  10 0.768849    0.000000    0.577737    0.0

现在,如果使用[p>

all_results_df.sort_values(('mean_test_score'), ascending=False).head(1).T

我们会得到

    16
mean_fit_time   0.228381
mean_score_time 0.113187
mean_test_score 0.580727
mean_train_score    0.766701
param_clf__C    1
param_clf__class_weight balanced
param_clf__penalty  l2
param_rfs__max_features 5
params  {'clf__class_weight': 'balanced', 'clf__penalt...
rank_test_score 1
split0_test_score   0.427273
split0_train_score  0.807051
split1_test_score   0.47
split1_train_score  0.791745
split2_test_score   0.54
split2_train_score  0.789243
split3_test_score   0.78
split3_train_score  0.769856
split4_test_score   0.7
split4_train_score  0.67561
std_fit_time    0.00586908
std_score_time  0.00152781
std_test_score  0.13555
std_train_score 0.0470554

让我们复制一下!

skf = StratifiedKFold(n_splits=5, random_state=0)
all_scores = list()

scores = []
weights = []


for train, test in skf.split(X, y):
    small_pipe_w_params = Pipeline([
                ('rfs', SelectFromModel(RandomForestClassifier(n_estimators=10, 
                                                               random_state=0),max_features=5)), 
                ('clf', LogisticRegression(class_weight='balanced', penalty='l2', C=1.0,random_state=0))
            ])
    small_pipe_w_params.fit(X[train, :], y[train])
    probas = small_pipe_w_params.predict_proba(X[test, :])
    # cache probas here to build an Roc w/ conf interval later
    scores.append(roc_auc_score(y[test], probas[:,1]))
    weights.append(len(test))

print(scores)
print('mean: {:<1.6f}, std: {:<1.3f}'.format(np.average(scores, axis=0, weights=weights), np.std(scores)))

[0.42727272727272736,0.47,0.54,0.78,0.7]
平均值:0.580727,标准差:0.135

[0.42727272727272736, 0.47, 0.54, 0.78, 0.7]
mean: 0.580727, std: 0.135

注意: mean_test_score 不仅仅是简单的平均值,而是加权平均值.原因是 iid 参数

Note: mean_test_score is not just simple average, its a weighted mean. Reason being iid param

来自文档:

iid:布尔值,默认=警告".如果为True,则返回倍数,由每个测试集中的样本数加权.在这个在这种情况下,假设数据在整个折叠,损失最小化是每个样品的总损失,而不是褶皱的平均损失.如果为False,则返回平均分数交叉褶皱.默认值为True,但在版本中将更改为False0.21,以符合交叉验证的标准定义.

iid : boolean, default=’warn’ If True, return the average score across folds, weighted by the number of samples in each test set. In this case, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds. If False, return the average score across folds. Default is True, but will change to False in version 0.21, to correspond to the standard definition of cross-validation.

在0.20版中进行了更改:参数iid将从True更改为False默认情况下为0.22版,并将在0.24版中删除.

Changed in version 0.20: Parameter iid will change from True to False by default in version 0.22, and will be removed in 0.24.

这篇关于为什么不能获得与GridSearchCV相同的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆