从GridSearchCV检索特定的分类器和数据 [英] Retrieving specific classifiers and data from GridSearchCV

查看:174
本文介绍了从GridSearchCV检索特定的分类器和数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码在服务器上运行Python 3分类脚本:

I am running a Python 3 classification script on a server using the following code:

# define knn classifier for transformed data
knn_classifier = neighbors.KNeighborsClassifier()

# define KNN parameters
knn_parameters = [{
    'n_neighbors': [1,3,5,7, 9, 11],
    'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'n_jobs': [-1],
    'weights': ['uniform', 'distance']}]

# Stratified k-fold (default for classifier)
# n = 5 folds is default
knn_models = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy')

# fit grid search models to transformed training data
knn_models.fit(X_train_transformed, y_train)

然后使用 pickle 保存 GridSearchCV 对象:

# save model
with open('knn_models.pickle', 'wb') as f:
    pickle.dump(knn_models, f)

因此,我可以通过运行以下命令在本地计算机上测试较小数据集上的分类器:

So I can test the classifiers on smaller datasets on my local machine by running:

knn_models = pickle.load(open("knn_models.pickle", "rb"))
validation_knn_model = knn_models.best_estimator_

如果我只想评估验证集上的最佳估计量,那就太好了。但是我实际上想做的是:

Which is great if I only want to evaluate the best estimator on a validation set. But what I'd actually like to do is:


  • 将原始数据从 GridSearchCV 对象(我假设它存储在对象中的某个位置,因为要对新的验证集进行分类,这是必需的)

  • 尝试一些几乎所有最好的分类器由网格搜索确定但更改特定输入参数的参数,即 k = 3、5、7

  • 检索 y_pred ,即我上面测试过的所有新分类器的每个验证集的预测

  • pull the original data out of the GridSearchCV object (I'm assuming it's stored somewhere in the object because to classify the new validation set, this is required)
  • try a few specific classifiers with almost all of the best parameters as determined by the grid search but changing a specific input parameter i.e. k = 3, 5, 7
  • retrieve y_pred i.e. the predictions for each validation set for all of the new classifiers that I tested above

推荐答案

GridSearchCV不包含原始数据(如果包含原始数据,那无疑是荒谬的)。它包含的唯一数据是其自己的簿记信息,即详细的分数和每个CV折叠尝试的参数。返回的 best_estimator _ 是将模型应用于遇到的任何新数据所需的唯一操作,但是,正如您所说,如果您想更深入地了解详细信息,结果将在其 cv_results _ 属性中返回。

GridSearchCV does not include the original data (and it would be arguably absurd if it did). The only data it includes is its own bookkeeping, i.e. the detailed scores & parameters tried per each CV fold. The best_estimator_ returned is the only thing needed to apply the model to any new data encountered, but if, as you say, you would like to dig deeper in the details, the full results are returned in its cv_results_ attribute.

文档到您自己的 knn_parameters 网格的knn分类器(但是删除 n_jobs ,这只会影响拟合速度,并且不是算法的真正超参数),并保持 cv = 3 为简单起见,我们有:

Adapting the example from the documentation to the knn classifier with your own knn_parameters grid (but removing n_jobs, which only affects the fitting speed, and it's not a real hyperparameter of the algorithm), and keeping cv=3 for simplicity, we have:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
import pandas as pd

iris = load_iris()
knn_parameters = [{
    'n_neighbors': [1,3,5,7, 9, 11],
    'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'weights': ['uniform', 'distance']}]

knn_classifier = KNeighborsClassifier()
clf = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy', n_jobs=-1, cv=3)
clf.fit(iris.data, iris.target)

clf.best_estimator_
# result:
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

因此,正如前所述,最后的结果告诉您将算法应用于任何新数据(验证,测试,来自部署等)。另外,您可能会发现实际上从 knn_parameters 网格中删除了 n_jobs 条目,而要求输入 GridSearchCV 对象中的> n_jobs = -1 导致很多更快的CV过程。但是,如果要在最终模型中使用 n_jobs = -1 ,则可以轻松地操作 best_estimator _ 来完成因此:

So, as said, this last result tells you all you need to know to apply the algorithm to any new data (validation, test, from deployment etc). Also, you may find that actually removing the n_jobs entry from the knn_parameters grid and asking instead for n_jobs=-1 in the GridSearchCV object results in a much faster CV procedure. Nevertheless, if you want to use n_jobs=-1 to your final model, you can easily manipulate the best_estimator_ to do so:

clf.best_estimator_.n_jobs = -1
clf.best_estimator_
# result
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
                     weights='uniform')

这实际上回答了您的第二个问题,因为您可以类似地操纵 best_estimator _ 进行更改其他超参数也是如此。

This actually answers your second question, since you can similarly manipulate the best_estimator_ to change other hyperparameters, too.

因此,找到最好的模型就是大多数人停止的地方。但是,如果出于某种原因想要进一步深入了解整个网格搜索过程的详细信息,则将详细结果返回到 cv_results _ 属性中,甚至可以将其导入到熊猫数据框以便于检查:

So, having found the best model is where most people would stop. But if, for any reason, you want to dig further into the details of the whole grid search process, the detailed results are returned in the cv_results_ attribute, which you can even import to a pandas dataframe for easier inspection:

cv_results = pd.DataFrame.from_dict(clf.cv_results_)

例如, cv_results 数据框包括列 rank_test_score ,顾名思义,它包含每个参数组合的等级:

For example, the cv_results dataframe includes a column rank_test_score which, as its name clearly implies, contains the rank of each parameter combination:

cv_results['rank_test_score']
# result:
0      481
1      481
2      145
3      145
4        1
      ... 
571      1
572    145
573    145
574    433
575      1
Name: rank_test_score, Length: 576, dtype: int32

此处 1 表示最佳,并且您可以很容易地看到排名为 1 -实际上,我们在这里有一个以上的最佳模型(即参数组合)!尽管在这里这很可能是由于所用虹膜数据集的相对简单性,但原则上也没有理由在实际情况下也不会发生这种情况。在这种情况下,返回的 best_estimator _ 只是这些情况中的第一个-这里是组合编号4:

Here 1 means best, and you can readily see that there are more than one combinations ranked as 1 - so in fact here we have more than one "best" models (i.e. parameter combinations)! Although here this is most probably due to the relative simplicity of the used iris dataset, there is no reason in principle why it cannot happen in a real case, too. In such cases, the returned best_estimator_ is just the first of these occurrences - here the combination number 4:

cv_results.iloc[4]
# result:
mean_fit_time                                              0.000669559
std_fit_time                                               1.55811e-05
mean_score_time                                             0.00474652
std_score_time                                             0.000488042
param_algorithm                                                   auto
param_leaf_size                                                      5
param_n_neighbors                                                    5
param_weights                                                  uniform
params               {'algorithm': 'auto', 'leaf_size': 5, 'n_neigh...
split0_test_score                                                 0.98
split1_test_score                                                 0.98
split2_test_score                                                 0.98
mean_test_score                                                   0.98
std_test_score                                                       0
rank_test_score                                                      1
Name: 4, dtype: object

,您可以轻松地看到它与上面的 best_estimator _ 具有相同的参数。但是现在您可以检查所有最佳广告素材了。

which you can easily see that has the same parameters with our best_estimator_ above. But now you can inspect all the "best" models, simply by:

cv_results.loc[cv_results['rank_test_score']==1]

,以我为例,该模型产生的模型不少于144个(总共 6 * 12 * 4 * 2 = 576 尝试过的模型)!因此,您实际上可以在更多选择中进行选择,甚至可以使用其他附加条件,例如返回分数的标准偏差(越小越好,尽管此处的最小值为0),而不是仅仅依赖于最高平均分数,这是自动程序将返回的分数。

which, in my case, results in no less than 144 models (out of the total 6*12*4*2 = 576 models tried)! So, you can in fact select among more choices, or even use other additional criteria, say the standard deviation of the returned score (the less the better, although here it is at the minimum value of 0), instead of relying simply to the maximum mean score, which is what the automatic procedure will return.

希望这些足以让您入门...

Hopefully these will be enough to get you started...

这篇关于从GridSearchCV检索特定的分类器和数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆