“平行"使用 gridsearch 获得最佳模型的管道 [英] "Parallel" pipeline to get best model using gridsearch
问题描述
在 sklearn 中,可以定义串行管道以获得管道所有连续部分的最佳超参数组合.串行管道可以按如下方式实现:
In sklearn, a serial pipeline can be defined to get the best combination of hyperparameters for all consecutive parts of the pipeline. A serial pipeline can be implemented as follows:
from sklearn.svm import SVC
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
digits = datasets.load_digits()
X_train = digits.data
y_train = digits.target
#Use Principal Component Analysis to reduce dimensionality
# and improve generalization
pca = decomposition.PCA()
# Use a linear SVC
svm = SVC()
# Combine PCA and SVC to a pipeline
pipe = Pipeline(steps=[('pca', pca), ('svm', svm)])
# Check the training time for the SVC
n_components = [20, 40, 64]
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
}
但是如果我想为管道的每个步骤尝试不同的算法怎么办?我怎么能例如
But what if I want to try different algorithms for each step of the pipeline? How can I e.g. gridsearch over
主成分分析或奇异值分解 AND支持向量机或随机森林
Principal Component Analysis OR Singular Value Decomposition AND Support Vector machines OR Random Forest
这将需要某种二级或元网格搜索",因为模型类型将是超参数之一.这在sklearn中可能吗?
This would require some kind of 2nd level or "meta-gridsearch", since the type of model would be one of the hyperparameters. Is that possible in sklearn?
推荐答案
Pipeline 在其 steps
(估算器列表)中支持 None
,通过它,管道的某些部分可以被关闭.
Pipeline supports None
in its steps
(list of estimators) by which certain part of the pipeline can be toggled off.
您可以将 None
参数传递给管道的 named_steps
以通过设置传递给 GridSearchCV 的参数来不使用该估计器.
You can pass None
parameter to the named_steps
of the pipeline to not use that estimator by setting that in params passed to GridSearchCV.
假设您想使用 PCA
和 TruncatedSVD
.
Lets assume you want to use PCA
and TruncatedSVD
.
pca = decomposition.PCA()
svd = decomposition.TruncatedSVD()
svm = SVC()
n_components = [20, 40, 64]
在管道中添加svd
pipe = Pipeline(steps=[('pca', pca), ('svd', svd), ('svm', svm)])
# Change params_grid -> Instead of dict, make it a list of dict**
# In the first element, pass `svd = None`, and in second `pca = None`
params_grid = [{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
'svd':[None]
},
{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca':[None],
'svd__n_components': n_components,
'svd__algorithm':['randomized']
}]
现在只需将管道对象传递给 gridsearchCV
and now just pass the pipeline object to gridsearchCV
grd = GridSearchCV(pipe, param_grid = params_grid)
调用 grd.fit()
将在 params_grid
列表的两个元素上搜索参数,一次使用一个值.
Calling grd.fit()
will search the parameters over both the elements of the params_grid
list, using all values from one at a time.
如果OR"中的两个估计器与本例中的参数名称相同,则 PCA
和 TruncatedSVD
具有 n_components
(或你只想搜索这个参数,这可以简化为:
If both estimators in your "OR" have same name of parameters as in this case, where PCA
and TruncatedSVD
has n_components
(or you just want to search over this parameter, this can be simplified as:
#Here I have changed the name to `preprocessor`
pipe = Pipeline(steps=[('preprocessor', pca), ('svm', svm)])
#Now assign both estimators to `preprocessor` as below:
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'preprocessor':[pca, svd],
'preprocessor__n_components': n_components,
}
该方案的推广
我们可以创建一个函数,它可以使用适当的值自动填充我们的 param_grid
以提供给 GridSearchCV
:-
We can make a function which can automatically populate our param_grid
to be supplied to the GridSearchCV
using appropriate values:-
def make_param_grids(steps, param_grids):
final_params=[]
# Itertools.product will do a permutation such that
# (pca OR svd) AND (svm OR rf) will become ->
# (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
for estimator_names in itertools.product(*steps.values()):
current_grid = {}
# Step_name and estimator_name should correspond
# i.e preprocessor must be from pca and select.
for step_name, estimator_name in zip(steps.keys(), estimator_names):
for param, value in param_grids.get(estimator_name).iteritems():
if param == 'object':
# Set actual estimator in pipeline
current_grid[step_name]=[value]
else:
# Set parameters corresponding to above estimator
current_grid[step_name+'__'+param]=value
#Append this dictionary to final params
final_params.append(current_grid)
return final_params
并在任意数量的转换器和估计器上使用此函数
And use this function on any number of transformers and estimators
# add all the estimators you want to "OR" in single key
# use OR between `pca` and `select`,
# use OR between `svm` and `rf`
# different keys will be evaluated as serial estimator in pipeline
pipeline_steps = {'preprocessor':['pca', 'select'],
'classifier':['svm', 'rf']}
# fill parameters to be searched in this dict
all_param_grids = {'svm':{'object':SVC(),
'C':[0.1,0.2]
},
'rf':{'object':RandomForestClassifier(),
'n_estimators':[10,20]
},
'pca':{'object':PCA(),
'n_components':[10,20]
},
'select':{'object':SelectKBest(),
'k':[5,10]
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
现在用上面使用的名称初始化管道对象pipeline_steps
Now initialize a pipeline object with names as used in above pipeline_steps
# The PCA() and SVC() used here are just to initialize the pipeline,
# actual estimators will be used from our `param_grids_list`
pipe = Pipeline(steps=[('preprocessor',PCA()), ('classifier', SVC())])
现在,最后设置 gridSearchCV 对象并拟合数据
Now, finally set out gridSearchCV object and fit data
grd = GridSearchCV(pipe, param_grid = param_grids_list)
grd.fit(X, y)
这篇关于“平行"使用 gridsearch 获得最佳模型的管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!