“并行";使用gridsearch获得最佳模型的管道 [英] "Parallel" pipeline to get best model using gridsearch

查看:56
本文介绍了“并行";使用gridsearch获得最佳模型的管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在sklearn中,可以定义一个串行管道以获得管道所有连续部分的超参数的最佳组合.串行管道可以实现如下:

In sklearn, a serial pipeline can be defined to get the best combination of hyperparameters for all consecutive parts of the pipeline. A serial pipeline can be implemented as follows:

from sklearn.svm import SVC
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

digits = datasets.load_digits()
X_train = digits.data
y_train = digits.target

#Use Principal Component Analysis to reduce dimensionality
# and improve generalization
pca = decomposition.PCA()
# Use a linear SVC
svm = SVC()
# Combine PCA and SVC to a pipeline
pipe = Pipeline(steps=[('pca', pca), ('svm', svm)])
# Check the training time for the SVC
n_components = [20, 40, 64]
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
}

但是,如果我想为管道的每个步骤尝试不同的算法怎么办?我该如何

But what if I want to try different algorithms for each step of the pipeline? How can I e.g. gridsearch over

主成分分析或奇异值分解,以及 支持向量机或随机森林

Principal Component Analysis OR Singular Value Decomposition AND Support Vector machines OR Random Forest

这将需要某种第二级或元网格搜索",因为模型的类型将是超参数之一.在sklearn中可能吗?

This would require some kind of 2nd level or "meta-gridsearch", since the type of model would be one of the hyperparameters. Is that possible in sklearn?

推荐答案

管道在其steps(估算器列表)中支持None,通过该管道可以关闭管道的某些部分.

Pipeline supports None in its steps(list of estimators) by which certain part of the pipeline can be toggled off.

您可以通过在传递给GridSearchCV的参数中设置None参数到管道的named_steps中,以不使用该估计量.

You can pass None parameter to the named_steps of the pipeline to not use that estimator by setting that in params passed to GridSearchCV.

假设您要使用 PCA TruncatedSVD .

Lets assume you want to use PCA and TruncatedSVD.

pca = decomposition.PCA()
svd = decomposition.TruncatedSVD()
svm = SVC()
n_components = [20, 40, 64]

在管道中添加svd

pipe = Pipeline(steps=[('pca', pca), ('svd', svd), ('svm', svm)])

# Change params_grid -> Instead of dict, make it a list of dict**
# In the first element, pass `svd = None`, and in second `pca = None`
params_grid = [{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
'svd':[None]
},
{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca':[None],
'svd__n_components': n_components,
'svd__algorithm':['randomized']
}]

现在将管道对象传递给gridsearchCV

and now just pass the pipeline object to gridsearchCV

grd = GridSearchCV(pipe, param_grid = params_grid)

调用grd.fit()将在params_grid列表的两个元素上搜索参数,一次使用一个中的所有值.

Calling grd.fit() will search the parameters over both the elements of the params_grid list, using all values from one at a time.

如果"OR"中的两个估计量都具有与本例中相同的参数名称,其中PCATruncatedSVD具有n_components(或者您只想搜索此参数,则可以简化为:

If both estimators in your "OR" have same name of parameters as in this case, where PCA and TruncatedSVD has n_components (or you just want to search over this parameter, this can be simplified as:

#Here I have changed the name to `preprocessor`
pipe = Pipeline(steps=[('preprocessor', pca), ('svm', svm)])

#Now assign both estimators to `preprocessor` as below:
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'preprocessor':[pca, svd],
'preprocessor__n_components': n_components,
}

此方案的通用化

我们可以创建一个函数,该函数可以使用适当的值自动填充要提供给GridSearchCVparam_grid:-

We can make a function which can automatically populate our param_grid to be supplied to the GridSearchCV using appropriate values:-

def make_param_grids(steps, param_grids):

    final_params=[]

    # Itertools.product will do a permutation such that 
    # (pca OR svd) AND (svm OR rf) will become ->
    # (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
    for estimator_names in itertools.product(*steps.values()):
        current_grid = {}

        # Step_name and estimator_name should correspond
        # i.e preprocessor must be from pca and select.
        for step_name, estimator_name in zip(steps.keys(), estimator_names):
            for param, value in param_grids.get(estimator_name).iteritems():
                if param == 'object':
                    # Set actual estimator in pipeline
                    current_grid[step_name]=[value]
                else:
                    # Set parameters corresponding to above estimator
                    current_grid[step_name+'__'+param]=value
        #Append this dictionary to final params            
        final_params.append(current_grid)

return final_params

并在任意数量的变换器和估计器上使用此功能

And use this function on any number of transformers and estimators

# add all the estimators you want to "OR" in single key
# use OR between `pca` and `select`, 
# use OR between `svm` and `rf`
# different keys will be evaluated as serial estimator in pipeline
pipeline_steps = {'preprocessor':['pca', 'select'],
                  'classifier':['svm', 'rf']}

# fill parameters to be searched in this dict
all_param_grids = {'svm':{'object':SVC(), 
                          'C':[0.1,0.2]
                         }, 

                   'rf':{'object':RandomForestClassifier(),
                         'n_estimators':[10,20]
                        },

                   'pca':{'object':PCA(),
                          'n_components':[10,20]
                         },

                   'select':{'object':SelectKBest(),
                             'k':[5,10]
                            }
                  }  


# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)

现在使用上面的pipeline_steps

# The PCA() and SVC() used here are just to initialize the pipeline,
# actual estimators will be used from our `param_grids_list`
pipe = Pipeline(steps=[('preprocessor',PCA()), ('classifier', SVC())])  

现在,终于列出了gridSearchCV对象并拟合数据

Now, finally set out gridSearchCV object and fit data

grd = GridSearchCV(pipe, param_grid = param_grids_list)
grd.fit(X, y)

这篇关于“并行";使用gridsearch获得最佳模型的管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆