“平行"使用 gridsearch 获得最佳模型的管道 [英] "Parallel" pipeline to get best model using gridsearch

查看:22
本文介绍了“平行"使用 gridsearch 获得最佳模型的管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 sklearn 中,可以定义串行管道以获得管道所有连续部分的最佳超参数组合.串行管道可以按如下方式实现:

In sklearn, a serial pipeline can be defined to get the best combination of hyperparameters for all consecutive parts of the pipeline. A serial pipeline can be implemented as follows:

from sklearn.svm import SVC
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

digits = datasets.load_digits()
X_train = digits.data
y_train = digits.target

#Use Principal Component Analysis to reduce dimensionality
# and improve generalization
pca = decomposition.PCA()
# Use a linear SVC
svm = SVC()
# Combine PCA and SVC to a pipeline
pipe = Pipeline(steps=[('pca', pca), ('svm', svm)])
# Check the training time for the SVC
n_components = [20, 40, 64]
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
}

但是如果我想为管道的每个步骤尝试不同的算法怎么办?我怎么能例如

But what if I want to try different algorithms for each step of the pipeline? How can I e.g. gridsearch over

主成分分析或奇异值分解 AND支持向量机或随机森林

Principal Component Analysis OR Singular Value Decomposition AND Support Vector machines OR Random Forest

这将需要某种二级或元网格搜索",因为模型类型将是超参数之一.这在sklearn中可能吗?

This would require some kind of 2nd level or "meta-gridsearch", since the type of model would be one of the hyperparameters. Is that possible in sklearn?

推荐答案

Pipeline 在其 steps(估算器列表)中支持 None,通过它,管道的某些部分可以被关闭.

Pipeline supports None in its steps(list of estimators) by which certain part of the pipeline can be toggled off.

您可以将 None 参数传递给管道的 named_steps 以通过设置传递给 GridSearchCV 的参数来不使用该估计器.

You can pass None parameter to the named_steps of the pipeline to not use that estimator by setting that in params passed to GridSearchCV.

假设您想使用 PCATruncatedSVD.

Lets assume you want to use PCA and TruncatedSVD.

pca = decomposition.PCA()
svd = decomposition.TruncatedSVD()
svm = SVC()
n_components = [20, 40, 64]

在管道中添加svd

pipe = Pipeline(steps=[('pca', pca), ('svd', svd), ('svm', svm)])

# Change params_grid -> Instead of dict, make it a list of dict**
# In the first element, pass `svd = None`, and in second `pca = None`
params_grid = [{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
'svd':[None]
},
{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca':[None],
'svd__n_components': n_components,
'svd__algorithm':['randomized']
}]

现在只需将管道对象传递给 gridsearchCV

and now just pass the pipeline object to gridsearchCV

grd = GridSearchCV(pipe, param_grid = params_grid)

调用 grd.fit() 将在 params_grid 列表的两个元素上搜索参数,一次使用一个值.

Calling grd.fit() will search the parameters over both the elements of the params_grid list, using all values from one at a time.

如果OR"中的两个估计器与本例中的参数名称相同,则 PCATruncatedSVD 具有 n_components(或你只想搜索这个参数,这可以简化为:

If both estimators in your "OR" have same name of parameters as in this case, where PCA and TruncatedSVD has n_components (or you just want to search over this parameter, this can be simplified as:

#Here I have changed the name to `preprocessor`
pipe = Pipeline(steps=[('preprocessor', pca), ('svm', svm)])

#Now assign both estimators to `preprocessor` as below:
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'preprocessor':[pca, svd],
'preprocessor__n_components': n_components,
}

该方案的推广

我们可以创建一个函数,它可以使用适当的值自动填充我们的 param_grid 以提供给 GridSearchCV:-

We can make a function which can automatically populate our param_grid to be supplied to the GridSearchCV using appropriate values:-

def make_param_grids(steps, param_grids):

    final_params=[]

    # Itertools.product will do a permutation such that 
    # (pca OR svd) AND (svm OR rf) will become ->
    # (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
    for estimator_names in itertools.product(*steps.values()):
        current_grid = {}

        # Step_name and estimator_name should correspond
        # i.e preprocessor must be from pca and select.
        for step_name, estimator_name in zip(steps.keys(), estimator_names):
            for param, value in param_grids.get(estimator_name).iteritems():
                if param == 'object':
                    # Set actual estimator in pipeline
                    current_grid[step_name]=[value]
                else:
                    # Set parameters corresponding to above estimator
                    current_grid[step_name+'__'+param]=value
        #Append this dictionary to final params            
        final_params.append(current_grid)

return final_params

并在任意数量的转换器和估计器上使用此函数

And use this function on any number of transformers and estimators

# add all the estimators you want to "OR" in single key
# use OR between `pca` and `select`, 
# use OR between `svm` and `rf`
# different keys will be evaluated as serial estimator in pipeline
pipeline_steps = {'preprocessor':['pca', 'select'],
                  'classifier':['svm', 'rf']}

# fill parameters to be searched in this dict
all_param_grids = {'svm':{'object':SVC(), 
                          'C':[0.1,0.2]
                         }, 

                   'rf':{'object':RandomForestClassifier(),
                         'n_estimators':[10,20]
                        },

                   'pca':{'object':PCA(),
                          'n_components':[10,20]
                         },

                   'select':{'object':SelectKBest(),
                             'k':[5,10]
                            }
                  }  


# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)

现在用上面使用的名称初始化管道对象pipeline_steps

Now initialize a pipeline object with names as used in above pipeline_steps

# The PCA() and SVC() used here are just to initialize the pipeline,
# actual estimators will be used from our `param_grids_list`
pipe = Pipeline(steps=[('preprocessor',PCA()), ('classifier', SVC())])  

现在,最后设置 gridSearchCV 对象并拟合数据

Now, finally set out gridSearchCV object and fit data

grd = GridSearchCV(pipe, param_grid = param_grids_list)
grd.fit(X, y)

这篇关于“平行"使用 gridsearch 获得最佳模型的管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆