“并行";使用gridsearch获得最佳模型的管道 [英] "Parallel" pipeline to get best model using gridsearch
问题描述
在sklearn中,可以定义一个串行管道以获得管道所有连续部分的超参数的最佳组合.串行管道可以实现如下:
In sklearn, a serial pipeline can be defined to get the best combination of hyperparameters for all consecutive parts of the pipeline. A serial pipeline can be implemented as follows:
from sklearn.svm import SVC
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
digits = datasets.load_digits()
X_train = digits.data
y_train = digits.target
#Use Principal Component Analysis to reduce dimensionality
# and improve generalization
pca = decomposition.PCA()
# Use a linear SVC
svm = SVC()
# Combine PCA and SVC to a pipeline
pipe = Pipeline(steps=[('pca', pca), ('svm', svm)])
# Check the training time for the SVC
n_components = [20, 40, 64]
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
}
但是,如果我想为管道的每个步骤尝试不同的算法怎么办?我该如何
But what if I want to try different algorithms for each step of the pipeline? How can I e.g. gridsearch over
主成分分析或奇异值分解,以及 支持向量机或随机森林
Principal Component Analysis OR Singular Value Decomposition AND Support Vector machines OR Random Forest
这将需要某种第二级或元网格搜索",因为模型的类型将是超参数之一.在sklearn中可能吗?
This would require some kind of 2nd level or "meta-gridsearch", since the type of model would be one of the hyperparameters. Is that possible in sklearn?
推荐答案
管道在其steps
(估算器列表)中支持None
,通过该管道可以关闭管道的某些部分.
Pipeline supports None
in its steps
(list of estimators) by which certain part of the pipeline can be toggled off.
您可以通过在传递给GridSearchCV的参数中设置None
参数到管道的named_steps
中,以不使用该估计量.
You can pass None
parameter to the named_steps
of the pipeline to not use that estimator by setting that in params passed to GridSearchCV.
假设您要使用 PCA
和 TruncatedSVD
.
Lets assume you want to use PCA
and TruncatedSVD
.
pca = decomposition.PCA()
svd = decomposition.TruncatedSVD()
svm = SVC()
n_components = [20, 40, 64]
在管道中添加svd
pipe = Pipeline(steps=[('pca', pca), ('svd', svd), ('svm', svm)])
# Change params_grid -> Instead of dict, make it a list of dict**
# In the first element, pass `svd = None`, and in second `pca = None`
params_grid = [{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
'svd':[None]
},
{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca':[None],
'svd__n_components': n_components,
'svd__algorithm':['randomized']
}]
现在将管道对象传递给gridsearchCV
and now just pass the pipeline object to gridsearchCV
grd = GridSearchCV(pipe, param_grid = params_grid)
调用grd.fit()
将在params_grid
列表的两个元素上搜索参数,一次使用一个中的所有值.
Calling grd.fit()
will search the parameters over both the elements of the params_grid
list, using all values from one at a time.
如果"OR"中的两个估计量都具有与本例中相同的参数名称,其中PCA
和TruncatedSVD
具有n_components
(或者您只想搜索此参数,则可以简化为:
If both estimators in your "OR" have same name of parameters as in this case, where PCA
and TruncatedSVD
has n_components
(or you just want to search over this parameter, this can be simplified as:
#Here I have changed the name to `preprocessor`
pipe = Pipeline(steps=[('preprocessor', pca), ('svm', svm)])
#Now assign both estimators to `preprocessor` as below:
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'preprocessor':[pca, svd],
'preprocessor__n_components': n_components,
}
此方案的通用化
我们可以创建一个函数,该函数可以使用适当的值自动填充要提供给GridSearchCV
的param_grid
:-
We can make a function which can automatically populate our param_grid
to be supplied to the GridSearchCV
using appropriate values:-
def make_param_grids(steps, param_grids):
final_params=[]
# Itertools.product will do a permutation such that
# (pca OR svd) AND (svm OR rf) will become ->
# (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
for estimator_names in itertools.product(*steps.values()):
current_grid = {}
# Step_name and estimator_name should correspond
# i.e preprocessor must be from pca and select.
for step_name, estimator_name in zip(steps.keys(), estimator_names):
for param, value in param_grids.get(estimator_name).iteritems():
if param == 'object':
# Set actual estimator in pipeline
current_grid[step_name]=[value]
else:
# Set parameters corresponding to above estimator
current_grid[step_name+'__'+param]=value
#Append this dictionary to final params
final_params.append(current_grid)
return final_params
并在任意数量的变换器和估计器上使用此功能
And use this function on any number of transformers and estimators
# add all the estimators you want to "OR" in single key
# use OR between `pca` and `select`,
# use OR between `svm` and `rf`
# different keys will be evaluated as serial estimator in pipeline
pipeline_steps = {'preprocessor':['pca', 'select'],
'classifier':['svm', 'rf']}
# fill parameters to be searched in this dict
all_param_grids = {'svm':{'object':SVC(),
'C':[0.1,0.2]
},
'rf':{'object':RandomForestClassifier(),
'n_estimators':[10,20]
},
'pca':{'object':PCA(),
'n_components':[10,20]
},
'select':{'object':SelectKBest(),
'k':[5,10]
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
现在使用上面的pipeline_steps
# The PCA() and SVC() used here are just to initialize the pipeline,
# actual estimators will be used from our `param_grids_list`
pipe = Pipeline(steps=[('preprocessor',PCA()), ('classifier', SVC())])
现在,终于列出了gridSearchCV对象并拟合数据
Now, finally set out gridSearchCV object and fit data
grd = GridSearchCV(pipe, param_grid = param_grids_list)
grd.fit(X, y)
这篇关于“并行";使用gridsearch获得最佳模型的管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!