sklearn:有一个过滤样本的估计器 [英] sklearn: Have an estimator that filters samples

查看:40
本文介绍了sklearn:有一个过滤样本的估计器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实现我自己的 Imputer.在某些条件下,我想过滤一些训练样本(我认为质量低).

I'm trying to implement my own Imputer. Under certain conditions, I would like to filter some of the train samples (that I deem low quality).

然而,由于 transform 方法只返回 X 而不是 y,并且 y 本身是一个 numpy数组(据我所知,我无法就地过滤),而且 - 当我使用 GridSearchCV 时 - y 我的 transform> 方法接收的是 None,我似乎找不到办法做到这一点.

However, since the transform method returns only X and not y, and y itself is a numpy array (which I can't filter in place to the best of my knowledge), and moreover - when I use GridSearchCV- the y my transform method receives is None, I can't seem to find a way to do it.

澄清一下:我非常清楚如何过滤数组.我找不到将 y 向量上的样本过滤适合当前 API 的方法.

Just to clarify: I'm perfectly clear on how to filter arrays. I can't find a way to fit sample filtering on the y vector into the current API.

我真的想从 BaseEstimator 实现中做到这一点,以便我可以将它与 GridSearchCV(它有几个参数)一起使用.我是否缺少实现样本过滤的不同方法(不是通过 BaseEstimator,而是通过 GridSearchCV 兼容)?有没有办法绕过当前的 API?

I really want to do that from a BaseEstimator implementation so that I could use it with GridSearchCV (it has a few parameters). Am I missing a different way to achieve sample filtration (not through BaseEstimator, but GridSearchCV compliant)? is there some way around the current API?

推荐答案

我找到了一个解决方案,它分为三个部分:

I have found a solution, which has three parts:

  1. if idx == id(self.X): 行.这将确保仅在训练集上过滤样本.
  2. 覆盖 fit_transform 以确保转换方法获得 y 而不是 None
  3. 覆盖Pipeline 以允许tranform 返回所述y.
  1. Have the if idx == id(self.X): line. This will make sure samples are filtered only on the training set.
  2. Override fit_transform to make sure the transform method gets y and not None
  3. Override the Pipeline to allow tranform to return said y.

这是一个演示它的示例代码,我想它可能没有涵盖所有微小的细节,但我认为它解决了 API 的主要问题.

Here's a sample code demonstrating it, I guess it might not cover all the tiny details but I think it solved the major issue which is with the API.

from sklearn.base import BaseEstimator
from mne.decoding.mixin import TransformerMixin
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.externals import six

class SampleAndFeatureFilter(BaseEstimator, TransformerMixin):
    def __init__(self, perc = None):
        self.perc = perc

    def fit(self, X, y=None):
        self.X = X
        sum_per_feature = X.sum(0)
        sum_per_sample = X.sum(1)
        self.featurefilter = sum_per_feature >= np.percentile(sum_per_feature, self.perc)
        self.samplefilter  = sum_per_sample >= np.percentile(sum_per_sample, self.perc)
        return self

    def transform(self, X, y=None, copy=None):
        idx = id(X)
        X=X[:,self.featurefilter]
        if idx == id(self.X):
            X = X[self.samplefilter, :]
            if y is not None:
                y = y[self.samplefilter]
            return X, y
        return X

    def fit_transform(self, X, y=None, **fit_params):
        if y is None:
            return self.fit(X, **fit_params).transform(X)
        else:
            return self.fit(X, y, **fit_params).transform(X,y)

class PipelineWithSampleFiltering(Pipeline):
    def fit_transform(self, X, y=None, **fit_params):
        Xt, yt, fit_params = self._pre_transform(X, y, **fit_params)
        if hasattr(self.steps[-1][-1], 'fit_transform'):
            return self.steps[-1][-1].fit_transform(Xt, yt, **fit_params)
        else:
            return self.steps[-1][-1].fit(Xt, yt, **fit_params).transform(Xt)

    def fit(self, X, y=None, **fit_params):
        Xt, yt, fit_params = self._pre_transform(X, y, **fit_params)
        self.steps[-1][-1].fit(Xt, yt, **fit_params)
        return self

    def _pre_transform(self, X, y=None, **fit_params):
        fit_params_steps = dict((step, {}) for step, _ in self.steps)
        for pname, pval in six.iteritems(fit_params):
            step, param = pname.split('__', 1)
            fit_params_steps[step][param] = pval
        Xt = X
        yt = y
        for name, transform in self.steps[:-1]:
            if hasattr(transform, "fit_transform"):
                res = transform.fit_transform(Xt, yt, **fit_params_steps[name])
                if len(res) == 2:
                    Xt, yt = res
                else:
                    Xt = res
            else:
                Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
                              .transform(Xt)
        return Xt, yt, fit_params_steps[self.steps[-1][0]]

if __name__ == '__main__':
    X = np.random.random((100,30))
    y = np.random.random_integers(0, 1, 100)
    pipe = PipelineWithSampleFiltering([('flt', SampleAndFeatureFilter()), ('cls', GaussianNB())])
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.3, random_state = 42)
    kfold = cross_validation.KFold(len(y_train), 10)
    clf = GridSearchCV(pipe, cv = kfold, param_grid = {'flt__perc':[10,20,30,40,50,60,70,80]}, n_jobs = 1)
    clf.fit(X_train, y_train)

这篇关于sklearn:有一个过滤样本的估计器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆