从 GridSearchCV 检索特定分类器和数据 [英] Retrieving specific classifiers and data from GridSearchCV
问题描述
我正在使用以下代码在服务器上运行 Python 3 分类脚本:
I am running a Python 3 classification script on a server using the following code:
# define knn classifier for transformed data
knn_classifier = neighbors.KNeighborsClassifier()
# define KNN parameters
knn_parameters = [{
'n_neighbors': [1,3,5,7, 9, 11],
'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'n_jobs': [-1],
'weights': ['uniform', 'distance']}]
# Stratified k-fold (default for classifier)
# n = 5 folds is default
knn_models = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy')
# fit grid search models to transformed training data
knn_models.fit(X_train_transformed, y_train)
然后我使用 pickle
保存 GridSearchCV
对象:
I then save the GridSearchCV
object using pickle
:
# save model
with open('knn_models.pickle', 'wb') as f:
pickle.dump(knn_models, f)
所以我可以通过运行以下命令在本地机器上的较小数据集上测试分类器:
So I can test the classifiers on smaller datasets on my local machine by running:
knn_models = pickle.load(open("knn_models.pickle", "rb"))
validation_knn_model = knn_models.best_estimator_
如果我只想评估验证集上的最佳估计器,这很好.但我真正想做的是:
Which is great if I only want to evaluate the best estimator on a validation set. But what I'd actually like to do is:
- 从
GridSearchCV
对象中提取原始数据(我假设它存储在对象中的某个地方,因为要对新的验证集进行分类,这是必需的) - 使用网格搜索确定的几乎所有最佳参数尝试一些特定分类器,但更改特定输入参数,即
k = 3, 5, 7
- 检索
y_pred
,即我上面测试的所有新分类器的每个验证集的预测
- pull the original data out of the
GridSearchCV
object (I'm assuming it's stored somewhere in the object because to classify the new validation set, this is required) - try a few specific classifiers with almost all of the best parameters as determined by the grid search but changing a specific input parameter i.e.
k = 3, 5, 7
- retrieve
y_pred
i.e. the predictions for each validation set for all of the new classifiers that I tested above
推荐答案
GridSearchCV 不包含原始数据(如果包含它可能会很荒谬).它包含的唯一数据是它自己的簿记,即详细的分数&每个 CV 折叠尝试的参数.返回的 best_estimator_
是将模型应用于遇到的任何新数据所需的唯一内容,但如果如您所说,您想更深入地挖掘细节,则完整结果将在其 <代码>cv_results_ 属性.
GridSearchCV does not include the original data (and it would be arguably absurd if it did). The only data it includes is its own bookkeeping, i.e. the detailed scores & parameters tried per each CV fold. The best_estimator_
returned is the only thing needed to apply the model to any new data encountered, but if, as you say, you would like to dig deeper in the details, the full results are returned in its cv_results_
attribute.
将文档中的示例改编为knn 分类器和你自己的 knn_parameters
网格(但删除了 n_jobs
,它只会影响拟合速度,它不是算法的真正超参数),并保留 cv=3
为简单起见,我们有:
Adapting the example from the documentation to the knn classifier with your own knn_parameters
grid (but removing n_jobs
, which only affects the fitting speed, and it's not a real hyperparameter of the algorithm), and keeping cv=3
for simplicity, we have:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
import pandas as pd
iris = load_iris()
knn_parameters = [{
'n_neighbors': [1,3,5,7, 9, 11],
'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'weights': ['uniform', 'distance']}]
knn_classifier = KNeighborsClassifier()
clf = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy', n_jobs=-1, cv=3)
clf.fit(iris.data, iris.target)
clf.best_estimator_
# result:
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
因此,如上所述,最后一个结果告诉您将算法应用于任何新数据(验证、测试、部署等)所需的全部信息.此外,您可能会发现实际上从 knn_parameters
网格中删除了 n_jobs
条目,而是在 GridSearchCV 中要求
n_jobs=-1
对象导致 更快的 CV 过程.不过,如果您想将 n_jobs=-1
用于最终模型,您可以轻松地操作 best_estimator_
来做到这一点:
So, as said, this last result tells you all you need to know to apply the algorithm to any new data (validation, test, from deployment etc). Also, you may find that actually removing the n_jobs
entry from the knn_parameters
grid and asking instead for n_jobs=-1
in the GridSearchCV
object results in a much faster CV procedure. Nevertheless, if you want to use n_jobs=-1
to your final model, you can easily manipulate the best_estimator_
to do so:
clf.best_estimator_.n_jobs = -1
clf.best_estimator_
# result
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
weights='uniform')
这实际上回答了您的第二个问题,因为您也可以类似地操作 best_estimator_
来更改其他超参数.
This actually answers your second question, since you can similarly manipulate the best_estimator_
to change other hyperparameters, too.
因此,找到最佳模型是大多数人会停下来的地方.但是,如果出于任何原因,您想进一步深入了解整个网格搜索过程的细节,则可以在 cv_results_
属性中返回详细结果,您甚至可以将其导入到 Pandas 数据帧中以更轻松检查:
So, having found the best model is where most people would stop. But if, for any reason, you want to dig further into the details of the whole grid search process, the detailed results are returned in the cv_results_
attribute, which you can even import to a pandas dataframe for easier inspection:
cv_results = pd.DataFrame.from_dict(clf.cv_results_)
例如,cv_results
数据框包含一列 rank_test_score
,顾名思义,该列包含每个参数组合的排名:
For example, the cv_results
dataframe includes a column rank_test_score
which, as its name clearly implies, contains the rank of each parameter combination:
cv_results['rank_test_score']
# result:
0 481
1 481
2 145
3 145
4 1
...
571 1
572 145
573 145
574 433
575 1
Name: rank_test_score, Length: 576, dtype: int32
这里 1
表示最好,你可以很容易地看到有不止一种组合被列为 1
- 所以实际上这里我们有不止一种最好";模型(即参数组合)!虽然这很可能是由于使用的 iris 数据集相对简单,但原则上没有理由为什么它也不能在真实情况下发生.在这种情况下,返回的 best_estimator_
只是这些事件中的第一个 - 这里是组合数 4:
Here 1
means best, and you can readily see that there are more than one combinations ranked as 1
- so in fact here we have more than one "best" models (i.e. parameter combinations)! Although here this is most probably due to the relative simplicity of the used iris dataset, there is no reason in principle why it cannot happen in a real case, too. In such cases, the returned best_estimator_
is just the first of these occurrences - here the combination number 4:
cv_results.iloc[4]
# result:
mean_fit_time 0.000669559
std_fit_time 1.55811e-05
mean_score_time 0.00474652
std_score_time 0.000488042
param_algorithm auto
param_leaf_size 5
param_n_neighbors 5
param_weights uniform
params {'algorithm': 'auto', 'leaf_size': 5, 'n_neigh...
split0_test_score 0.98
split1_test_score 0.98
split2_test_score 0.98
mean_test_score 0.98
std_test_score 0
rank_test_score 1
Name: 4, dtype: object
您可以很容易地看到它与我们上面的 best_estimator_
具有相同的参数.但是现在您可以检查所有最好的"模型,只需:
which you can easily see that has the same parameters with our best_estimator_
above. But now you can inspect all the "best" models, simply by:
cv_results.loc[cv_results['rank_test_score']==1]
在我的例子中,产生不少于 144 个模型(在尝试的 6*12*4*2 = 576
个模型中)!因此,您实际上可以在更多选择中进行选择,甚至可以使用其他附加标准,例如返回分数的标准差(越少越好,尽管这里是最小值 0),而不是简单地依赖于最大平均分,这是自动程序将返回的.
which, in my case, results in no less than 144 models (out of the total 6*12*4*2 = 576
models tried)! So, you can in fact select among more choices, or even use other additional criteria, say the standard deviation of the returned score (the less the better, although here it is at the minimum value of 0), instead of relying simply to the maximum mean score, which is what the automatic procedure will return.
希望这些足以让你开始......
Hopefully these will be enough to get you started...
这篇关于从 GridSearchCV 检索特定分类器和数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!