将自定义函数放入 Sklearn 管道中 [英] Put customized functions in Sklearn pipeline

查看:24
本文介绍了将自定义函数放入 Sklearn 管道中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的分类方案中,有几个步骤,包括:

  1. SMOTE(合成少数过采样技术)
  2. Fisher 特征选择标准
  3. 标准化(Z 分数标准化)
  4. SVC(支持向量分类器)

上述方案中要调整的主要参数是百分位数 (2.) 和 SVC 的超参数 (4.),我想通过网格搜索进行调整.

当前的解决方案构建了一个部分"管道包括方案中的第 3 步和第 4 步 clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])并将方案分为两部分:

  1. 调整特征的百分位数以保持第一次网格搜索

    skf = StratifiedKFold(y)对于 skf 中的 train_ind、test_ind:X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]# SMOTE 合成训练数据(我们要保持测试数据完整)X_train, y_train = SMOTE(X_train, y_train)对于百分位数:# Fisher 返回由参数 'percentile' 指定的所选特征的索引selected_ind = Fisher(X_train,y_train,百分位数)X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]模型 = clf.fit(X_train_selected, y_train)y_predict = model.predict(X_test_selected)f1 = f1_score(y_predict, y_test)

    f1 分数将被存储,然后通过所有百分位数的所有折叠分区进行平均,并返回具有最佳 CV 分数的百分位数.将percentile for loop"作为内部循环的目的是为了允许公平竞争,因为我们在所有百分位数的所有折叠分区中拥有相同的训练数据(包括合成数据).

  2. 确定百分位数后,通过二次网格搜索调整超参数

    skf = StratifiedKFold(y)对于 skf 中的 train_ind、test_ind:X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]# SMOTE 合成训练数据(我们要保持测试数据完整)X_train, y_train = SMOTE(X_train, y_train)对于 parameter_comb 中的参数:# 根据调整后的百分位选择特征selected_ind = Fisher(X_train, y_train, best_percentile)X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])模型 = clf.fit(X_train_selected, y_train)y_predict = model.predict(X_test_selected)f1 = f1_score(y_predict, y_test)

它以非常相似的方式完成,除了我们为 SVC 调整超参数而不是要选择的特征的百分比.

我的问题是:

  1. 在当前的解决方案中,我只涉及 clf 中的 3. 和 4. 并执行 1. 和 2. 有点手动"在如上所述的两个嵌套循环中.有没有办法将所有四个步骤都包含在一个管道中并一次完成整个过程?

  2. 如果可以保留第一个嵌套循环,那么是否可以(以及如何)使用单个管道简化下一个嵌套循环

    clf_all = Pipeline([('smote', SMOTE()),('fisher', Fisher(percentile=best_percentile))('正常',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])

    并简单地使用 GridSearchCV(clf_all, parameter_comb) 进行调整?

    请注意,SMOTEFisher(排名标准)都必须仅针对每个折叠分区中的训练数据进行.

如有任何评论,我们将不胜感激.

SMOTEFisher 如下所示:

def Fscore(X, y, percentile=None):X_pos, X_neg = X[y==1], X[y==0]X_mean = X.mean(axis=0)X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2F = 数量/代诺sort_F = argsort(F)[::-1]n_feature = (float(percentile)/100)*shape(X)[1]ind_feature = sort_F[:ceil(n_feature)]返回(ind_feature)

SMOTE 来自 https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py,它返回合成数据.我修改了它以返回与合成数据堆叠在一起的原始输入数据及其标签和合成数据.

def smote(X, y):n_pos = sum(y==1), sum(y==0)n_syn = (n_neg-n_pos)/float(n_pos)X_pos = X[y==1]X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)y_syn = np.ones(shape(X_syn)[0])X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])返回(X,Y)

解决方案

我不知道你的 SMOTE()Fisher() 函数来自哪里,但答案是肯定的,你绝对可以做到这一点.为此,您需要围绕这些函数编写一个包装类.最简单的方法是继承 sklearn 的 BaseEstimatorTransformerMixin 类,参见这个例子:http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

如果这对您没有意义,请发布至少一个函数的详细信息(它来自的库或您自己编写的代码),我们可以从那里开始.

我很抱歉,我没有仔细观察您的函数以意识到它们除了转换您的训练数据(即 X 和 y)之外还转换了您的目标.管道不支持转换到您的目标,因此您将像原来一样事先进行转换.供您参考,以下是为您的 Fisher 过程编写自定义类的样子,如果函数本身不需要影响您的目标变量,该类将起作用.

<预><代码>>>>从 sklearn.base 导入 BaseEstimator,TransformerMixin>>>从 sklearn.preprocessing 导入 StandardScaler>>>从 sklearn.svm 导入 SVC>>>从 sklearn.pipeline 导入管道>>>从 sklearn.grid_search 导入 GridSearchCV>>>从 sklearn.datasets 导入 load_iris>>>>>>类Fisher(BaseEstimator,TransformerMixin):... def __init__(self,percentile=0.95):... self.percentile = 百分位数... def fit(self, X, y):...从 numpy 导入形状,argsort,ceil... X_pos, X_neg = X[y==1], X[y==0]... X_mean = X.mean(axis=0)... X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)... deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(轴=0)... num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2... F = 数量/代诺... sort_F = argsort(F)[::-1]... n_feature = (float(self.percentile)/100)*shape(X)[1]... self.ind_feature = sort_F[:ceil(n_feature)]... 回归自我... def 变换(self, x):...返回 x[self.ind_feature,:]...>>>>>>数据 = load_iris()>>>>>>管道 = 管道([... ('fisher', Fisher()),... ('正常',StandardScaler()),... ('svm',SVC(class_weight='auto'))...])>>>>>>网格 = {...'fisher__percentile':[0.75,0.50],...'svm__C':[1,2]... }>>>>>>模型 = GridSearchCV(估算器 = 管道,param_grid=grid,cv=2)>>>模型拟合(数据.数据,数据.目标)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py​​",第596行,合适返回 self._fit(X, y, ParameterGrid(self.param_grid))文件/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py​​",第 378 行,在 _fit对于 parameter_iterable 中的参数文件/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",第653行,在__call__self.dispatch(函数,参数,kwargs)文件/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",第400行,在调度中工作 = 立即应用(函数,参数,kwargs)文件/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",第138行,在__init__self.results = func(*args, **kwargs)_fit_and_score 中的文件/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py",第 1239 行estimator.fit(X_train, y_train, **fit_params)文件/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py",第130行,合适self.steps[-1][-1].fit(Xt, y, **fit_params)文件/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py",第149行,合适(X.shape[0], y.shape[0]))ValueError: X 和 y 的形状不兼容.X 有 1 个样本,但 y 有 75 个.

In my classification scheme, there are several steps including:

  1. SMOTE (Synthetic Minority Over-sampling Technique)
  2. Fisher criteria for feature selection
  3. Standardization (Z-score normalisation)
  4. SVC (Support Vector Classifier)

The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning.

The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))]) and breaks the scheme into two parts:

  1. Tune the percentile of features to keep through the first grid search

    skf = StratifiedKFold(y)
    for train_ind, test_ind in skf:
        X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
        # SMOTE synthesizes the training data (we want to keep test data intact)
        X_train, y_train = SMOTE(X_train, y_train)
        for percentile in percentiles:
            # Fisher returns the indices of the selected features specified by the parameter 'percentile'
            selected_ind = Fisher(X_train, y_train, percentile) 
            X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
            model = clf.fit(X_train_selected, y_train)
            y_predict = model.predict(X_test_selected)
            f1 = f1_score(y_predict, y_test)
    

    The f1 scores will be stored and then be averaged through all fold partitions for all percentiles, and the percentile with the best CV score is returned. The purpose of putting 'percentile for loop' as the inner loop is to allow fair competition as we have the same training data (including synthesized data) across all fold partitions for all percentiles.

  2. After determining the percentile, tune the hyperparameters by second grid search

    skf = StratifiedKFold(y)
    for train_ind, test_ind in skf:
        X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
        # SMOTE synthesizes the training data (we want to keep test data intact)
        X_train, y_train = SMOTE(X_train, y_train)
        for parameters in parameter_comb:
            # Select the features based on the tuned percentile
            selected_ind = Fisher(X_train, y_train, best_percentile) 
            X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
            clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
            model = clf.fit(X_train_selected, y_train)
            y_predict = model.predict(X_test_selected)
            f1 = f1_score(y_predict, y_test)
    

It is done in the very similar way, except we tune the hyperparamter for SVC rather than percentile of features to select.

My questions are:

  1. In the current solution, I only involve 3. and 4. in the clf and do 1. and 2. kinda "manually" in two nested loop as described above. Is there any way to include all four steps in a pipeline and do the whole process at once?

  2. If it is okay to keep the first nested loop, then is it possible (and how) to simplify the next nested loop using a single pipeline

    clf_all = Pipeline([('smote', SMOTE()),
                        ('fisher', Fisher(percentile=best_percentile))
                        ('normal',preprocessing.StandardScaler()),
                        ('svc',svm.SVC(class_weight='auto'))]) 
    

    and simply use GridSearchCV(clf_all, parameter_comb) for tuning?

    Please note that both SMOTE and Fisher (ranking criteria) have to be done only for the training data in each fold partition.

It would be so much appreciated for any comment.

SMOTE and Fisher are shown below:

def Fscore(X, y, percentile=None):
    X_pos, X_neg = X[y==1], X[y==0]
    X_mean = X.mean(axis=0)
    X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
    num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    F = num/deno
    sort_F = argsort(F)[::-1]
    n_feature = (float(percentile)/100)*shape(X)[1]
    ind_feature = sort_F[:ceil(n_feature)]
    return(ind_feature)

SMOTE is from https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py, it returns the synthesized data. I modified it to return the original input data stacked with the synthesized data along with its labels and synthesized ones.

def smote(X, y):
    n_pos = sum(y==1), sum(y==0)
    n_syn = (n_neg-n_pos)/float(n_pos) 
    X_pos = X[y==1]
    X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
    y_syn = np.ones(shape(X_syn)[0])
    X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
    return(X, y)

解决方案

I don't know where your SMOTE() and Fisher() functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn's BaseEstimator and TransformerMixin classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

If this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.

EDIT:

I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.

>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>> 
>>> class Fisher(BaseEstimator, TransformerMixin):
...     def __init__(self,percentile=0.95):
...             self.percentile = percentile
...     def fit(self, X, y):
...             from numpy import shape, argsort, ceil
...             X_pos, X_neg = X[y==1], X[y==0]
...             X_mean = X.mean(axis=0)
...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
...             F = num/deno
...             sort_F = argsort(F)[::-1]
...             n_feature = (float(self.percentile)/100)*shape(X)[1]
...             self.ind_feature = sort_F[:ceil(n_feature)]
...             return self
...     def transform(self, x):
...             return x[self.ind_feature,:]
... 
>>> 
>>> data = load_iris()
>>> 
>>> pipeline = Pipeline([
...     ('fisher', Fisher()),
...     ('normal',StandardScaler()),
...     ('svm',SVC(class_weight='auto'))
... ])
>>> 
>>> grid = {
...     'fisher__percentile':[0.75,0.50],
...     'svm__C':[1,2]
... }
>>> 
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
    for parameters in parameter_iterable
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
    (X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.

这篇关于将自定义函数放入 Sklearn 管道中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆