将自定义函数放在Sklearn管道中 [英] Put customized functions in Sklearn pipeline

查看:79
本文介绍了将自定义函数放在Sklearn管道中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的分类方案中,有几个步骤,包括:

In my classification scheme, there are several steps including:

  1. SMOTE(综合少数族裔过采样技术)
  2. 选择功能的Fisher标准
  3. 标准化(Z分数标准化)
  4. SVC(支持向量分类器)

上面方案中要调整的主要参数是百分位(2.)和SVC的超参数(4.),我想通过网格搜索进行调整.

The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning.

当前解决方案在方案clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])中构建了一个包括步骤3和4的部分"管道. 并将该方案分为两部分:

The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))]) and breaks the scheme into two parts:

1)调整要素的百分位数,以保留第一个网格搜索

1) Tune the percentile of features to keep through the first grid search

skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
    X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
    # SMOTE synthesizes the training data (we want to keep test data intact)
    X_train, y_train = SMOTE(X_train, y_train)
    for percentile in percentiles:
        # Fisher returns the indices of the selected features specified by the parameter 'percentile'
        selected_ind = Fisher(X_train, y_train, percentile) 
        X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
        model = clf.fit(X_train_selected, y_train)
        y_predict = model.predict(X_test_selected)
        f1 = f1_score(y_predict, y_test)

将存储f1分数,然后将所有百分位数的所有折叠分区平均,然后返回具有最佳CV分数的百分位数.将百分位数换循环"作为内部循环的目的是为了实现公平竞争,因为我们在所有百分位数的所有折叠分区上都具有相同的训练数据(包括合成数据).

The f1 scores will be stored and then be averaged through all fold partitions for all percentiles, and the percentile with the best CV score is returned. The purpose of putting 'percentile for loop' as the inner loop is to allow fair competition as we have the same training data (including synthesized data) across all fold partitions for all percentiles.

2)确定百分位数后,通过第二次网格搜索调整超参数

2) After determining the percentile, tune the hyperparameters by second grid search

skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
    X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
    # SMOTE synthesizes the training data (we want to keep test data intact)
    X_train, y_train = SMOTE(X_train, y_train)
    for parameters in parameter_comb:
        # Select the features based on the tuned percentile
        selected_ind = Fisher(X_train, y_train, best_percentile) 
        X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
        clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
        model = clf.fit(X_train_selected, y_train)
        y_predict = model.predict(X_test_selected)
        f1 = f1_score(y_predict, y_test)

这是通过非常类似的方式完成的,只是我们为SVC调整了超参数,而不是为要选择的功能调整了百分位.

It is done in the very similar way, except we tune the hyperparamter for SVC rather than percentile of features to select.

我的问题是:

I)在当前解决方案中,我只涉及clf中的3.和4.,并如上所述在两个嵌套循环中手动"执行1.和2..有什么办法可以将所有四个步骤都包括在管道中,并且可以一次完成整个过程吗?

I) In the current solution, I only involve 3. and 4. in the clf and do 1. and 2. kinda "manually" in two nested loop as described above. Is there any way to include all four steps in a pipeline and do the whole process at once?

II)如果可以保留第一个嵌套循环,那么是否可以(以及如何)使用单个管道简化下一个嵌套循环

II) If it is okay to keep the first nested loop, then is it possible (and how) to simplify the next nested loop using a single pipeline

clf_all = Pipeline([('smote', SMOTE()),
                    ('fisher', Fisher(percentile=best_percentile))
                    ('normal',preprocessing.StandardScaler()),
                    ('svc',svm.SVC(class_weight='auto'))]) 

并简单地使用GridSearchCV(clf_all, parameter_comb)进行调整?

and simply use GridSearchCV(clf_all, parameter_comb) for tuning?

请注意,对于每个折叠分区中的训练数据,都必须同时执行SMOTEFisher(排名标准).

Please note that both SMOTE and Fisher (ranking criteria) have to be done only for the training data in each fold partition.

任何评论将不胜感激.

编辑 SMOTEFisher如下所示:

def Fscore(X, y, percentile=None):
    X_pos, X_neg = X[y==1], X[y==0]
    X_mean = X.mean(axis=0)
    X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
    num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    F = num/deno
    sort_F = argsort(F)[::-1]
    n_feature = (float(percentile)/100)*shape(X)[1]
    ind_feature = sort_F[:ceil(n_feature)]
    return(ind_feature)

SMOTE来自 https://github.com/blacklab /nyan/blob/master/shared_modules/smote.py ,它将返回综合数据.我对其进行了修改,以返回与合成数据一起存储的原始输入数据以及其标签和合成标签.

SMOTE is from https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py, it returns the synthesized data. I modified it to return the original input data stacked with the synthesized data along with its labels and synthesized ones.

def smote(X, y):
n_pos = sum(y==1), sum(y==0)
n_syn = (n_neg-n_pos)/float(n_pos) 
X_pos = X[y==1]
X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
y_syn = np.ones(shape(X_syn)[0])
X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
return(X, y)

推荐答案

我不知道您的SMOTE()Fisher()函数来自何处,但是答案是肯定的,您可以这样做.为此,您将需要围绕这些函数编写包装器类.最简单的方法是继承sklearn的BaseEstimatorTransformerMixin类,请参见以下示例:

I don't know where your SMOTE() and Fisher() functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn's BaseEstimator and TransformerMixin classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

如果这对您没有意义,请发布至少一个函数的详细信息(它来自于库或您自己编写的代码),我们可以从那里开始.

If this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.

很抱歉,我没有足够仔细地研究您的功能,以至于除了训练数据(X和y)之外,它们还改变了您的目标.管道不支持到目标的转换,因此您将像原来一样进行转换.供您参考,这是为Fisher进程编写自定义类的外观,如果该函数本身不需要影响目标变量,该类将起作用.

I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.

>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>> 
>>> class Fisher(BaseEstimator, TransformerMixin):
...     def __init__(self,percentile=0.95):
...             self.percentile = percentile
...     def fit(self, X, y):
...             from numpy import shape, argsort, ceil
...             X_pos, X_neg = X[y==1], X[y==0]
...             X_mean = X.mean(axis=0)
...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
...             F = num/deno
...             sort_F = argsort(F)[::-1]
...             n_feature = (float(self.percentile)/100)*shape(X)[1]
...             self.ind_feature = sort_F[:ceil(n_feature)]
...             return self
...     def transform(self, x):
...             return x[self.ind_feature,:]
... 
>>> 
>>> data = load_iris()
>>> 
>>> pipeline = Pipeline([
...     ('fisher', Fisher()),
...     ('normal',StandardScaler()),
...     ('svm',SVC(class_weight='auto'))
... ])
>>> 
>>> grid = {
...     'fisher__percentile':[0.75,0.50],
...     'svm__C':[1,2]
... }
>>> 
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
    for parameters in parameter_iterable
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
    (X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.

这篇关于将自定义函数放在Sklearn管道中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆