从 GridSearchCV 检索特定分类器和数据 [英] Retrieving specific classifiers and data from GridSearchCV

查看:26
本文介绍了从 GridSearchCV 检索特定分类器和数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码在服务器上运行 Python 3 分类脚本:

I am running a Python 3 classification script on a server using the following code:

# define knn classifier for transformed data
knn_classifier = neighbors.KNeighborsClassifier()

# define KNN parameters
knn_parameters = [{
    'n_neighbors': [1,3,5,7, 9, 11],
    'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'n_jobs': [-1],
    'weights': ['uniform', 'distance']}]

# Stratified k-fold (default for classifier)
# n = 5 folds is default
knn_models = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy')

# fit grid search models to transformed training data
knn_models.fit(X_train_transformed, y_train)

然后我使用 pickle 保存 GridSearchCV 对象:

I then save the GridSearchCV object using pickle:

# save model
with open('knn_models.pickle', 'wb') as f:
    pickle.dump(knn_models, f)

所以我可以通过运行以下命令在本地机器上的较小数据集上测试分类器:

So I can test the classifiers on smaller datasets on my local machine by running:

knn_models = pickle.load(open("knn_models.pickle", "rb"))
validation_knn_model = knn_models.best_estimator_

如果我只想评估验证集上的最佳估计器,这很好.但我真正想做的是:

Which is great if I only want to evaluate the best estimator on a validation set. But what I'd actually like to do is:

  • GridSearchCV 对象中提取原始数据(我假设它存储在对象中的某个地方,因为要对新的验证集进行分类,这是必需的)
  • 使用网格搜索确定的几乎所有最佳参数尝试一些特定分类器,但更改特定输入参数,即 k = 3, 5, 7
  • 检索 y_pred,即我上面测试的所有新分类器的每个验证集的预测
  • pull the original data out of the GridSearchCV object (I'm assuming it's stored somewhere in the object because to classify the new validation set, this is required)
  • try a few specific classifiers with almost all of the best parameters as determined by the grid search but changing a specific input parameter i.e. k = 3, 5, 7
  • retrieve y_pred i.e. the predictions for each validation set for all of the new classifiers that I tested above

推荐答案

GridSearchCV 不包含原始数据(如果包含它可能会很荒谬).它包含的唯一数据是它自己的簿记,即详细的分数&每个 CV 折叠尝试的参数.返回的 best_estimator_ 是将模型应用于遇到的任何新数据所需的唯一内容,但如果如您所说,您想更深入地挖掘细节,则完整结果将在其 <代码>cv_results_ 属性.

GridSearchCV does not include the original data (and it would be arguably absurd if it did). The only data it includes is its own bookkeeping, i.e. the detailed scores & parameters tried per each CV fold. The best_estimator_ returned is the only thing needed to apply the model to any new data encountered, but if, as you say, you would like to dig deeper in the details, the full results are returned in its cv_results_ attribute.

文档中的示例改编为knn 分类器和你自己的 knn_parameters 网格(但删除了 n_jobs,它只会影响拟合速度,它不是算法的真正超参数),并保留 cv=3 为简单起见,我们有:

Adapting the example from the documentation to the knn classifier with your own knn_parameters grid (but removing n_jobs, which only affects the fitting speed, and it's not a real hyperparameter of the algorithm), and keeping cv=3 for simplicity, we have:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
import pandas as pd

iris = load_iris()
knn_parameters = [{
    'n_neighbors': [1,3,5,7, 9, 11],
    'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'weights': ['uniform', 'distance']}]

knn_classifier = KNeighborsClassifier()
clf = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy', n_jobs=-1, cv=3)
clf.fit(iris.data, iris.target)

clf.best_estimator_
# result:
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

因此,如上所述,最后一个结果告诉您将算法应用于任何新数据(验证、测试、部署等)所需的全部信息.此外,您可能会发现实际上从 knn_parameters 网格中删除了 n_jobs 条目,而是在 GridSearchCV 中要求 n_jobs=-1 对象导致 更快的 CV 过程.不过,如果您想将 n_jobs=-1 用于最终模型,您可以轻松地操作 best_estimator_ 来做到这一点:

So, as said, this last result tells you all you need to know to apply the algorithm to any new data (validation, test, from deployment etc). Also, you may find that actually removing the n_jobs entry from the knn_parameters grid and asking instead for n_jobs=-1 in the GridSearchCV object results in a much faster CV procedure. Nevertheless, if you want to use n_jobs=-1 to your final model, you can easily manipulate the best_estimator_ to do so:

clf.best_estimator_.n_jobs = -1
clf.best_estimator_
# result
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
                     weights='uniform')

这实际上回答了您的第二个问题,因为您也可以类似地操作 best_estimator_ 来更改其他超参数.

This actually answers your second question, since you can similarly manipulate the best_estimator_ to change other hyperparameters, too.

因此,找到最佳模型是大多数人会停下来的地方.但是,如果出于任何原因,您想进一步深入了解整个网格搜索过程的细节,则可以在 cv_results_ 属性中返回详细结果,您甚至可以将其导入到 Pandas 数据帧中以更轻松检查:

So, having found the best model is where most people would stop. But if, for any reason, you want to dig further into the details of the whole grid search process, the detailed results are returned in the cv_results_ attribute, which you can even import to a pandas dataframe for easier inspection:

cv_results = pd.DataFrame.from_dict(clf.cv_results_)

例如,cv_results 数据框包含一列 rank_test_score,顾名思义,该列包含每个参数组合的排名:

For example, the cv_results dataframe includes a column rank_test_score which, as its name clearly implies, contains the rank of each parameter combination:

cv_results['rank_test_score']
# result:
0      481
1      481
2      145
3      145
4        1
      ... 
571      1
572    145
573    145
574    433
575      1
Name: rank_test_score, Length: 576, dtype: int32

这里 1 表示最好,你可以很容易地看到有不止一种组合被列为 1 - 所以实际上这里我们有不止一种最好";模型(即参数组合)!虽然这很可能是由于使用的 iris 数据集相对简单,但原则上没有理由为什么它也不能在真实情况下发生.在这种情况下,返回的 best_estimator_ 只是这些事件中的第一个 - 这里是组合数 4:

Here 1 means best, and you can readily see that there are more than one combinations ranked as 1 - so in fact here we have more than one "best" models (i.e. parameter combinations)! Although here this is most probably due to the relative simplicity of the used iris dataset, there is no reason in principle why it cannot happen in a real case, too. In such cases, the returned best_estimator_ is just the first of these occurrences - here the combination number 4:

cv_results.iloc[4]
# result:
mean_fit_time                                              0.000669559
std_fit_time                                               1.55811e-05
mean_score_time                                             0.00474652
std_score_time                                             0.000488042
param_algorithm                                                   auto
param_leaf_size                                                      5
param_n_neighbors                                                    5
param_weights                                                  uniform
params               {'algorithm': 'auto', 'leaf_size': 5, 'n_neigh...
split0_test_score                                                 0.98
split1_test_score                                                 0.98
split2_test_score                                                 0.98
mean_test_score                                                   0.98
std_test_score                                                       0
rank_test_score                                                      1
Name: 4, dtype: object

您可以很容易地看到它与我们上面的 best_estimator_ 具有相同的参数.但是现在您可以检查所有最好的"模型,只需:

which you can easily see that has the same parameters with our best_estimator_ above. But now you can inspect all the "best" models, simply by:

cv_results.loc[cv_results['rank_test_score']==1]

在我的例子中,产生不少于 144 个模型(在尝试的 6*12*4*2 = 576 个模型中)!因此,您实际上可以在更多选择中进行选择,甚至可以使用其他附加标准,例如返回分数的标准差(越少越好,尽管这里是最小值 0),而不是简单地依赖于最大平均分,这是自动程序将返回的.

which, in my case, results in no less than 144 models (out of the total 6*12*4*2 = 576 models tried)! So, you can in fact select among more choices, or even use other additional criteria, say the standard deviation of the returned score (the less the better, although here it is at the minimum value of 0), instead of relying simply to the maximum mean score, which is what the automatic procedure will return.

希望这些足以让你开始......

Hopefully these will be enough to get you started...

这篇关于从 GridSearchCV 检索特定分类器和数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆