sklearn管道的自定义转换器,可同时更改X和y [英] Custom transformer for sklearn Pipeline that alters both X and y

查看:132
本文介绍了sklearn管道的自定义转换器,可同时更改X和y的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建自己的变压器,以与sklearn Pipeline一起使用.因此,我正在创建一个同时实现fit和transform方法的类.转换器的目的是从矩阵中删除NaN数量超过指定数量的行.因此,我面临的问题是如何更改传递到转换器的X和y矩阵?我认为这必须在fit方法中完成,因为它可以同时访问X和y.由于一旦我将X重新分配给具有较少行的新矩阵,python就会通过赋值传递参数,因此丢失了对原始X的引用(当然,对于y也是如此).是否可以保留此参考?

I want to create my own transformer for use with the sklearn Pipeline. Hence I am creating a class that implements both fit and transform methods. The purpose of the transformer will be to remove rows from the matrix that have more than a specified number of NaNs. So the issue I am facing is how can I change both the X and y matrices that are passed to the transformer? I believe this has to be done in the fit method since it has access to both X and y. Since python passes arguments by assignment once I reassign X to a new matrix with fewer rows the reference to the original X is lost (and of course the same is true for y). Is it possible to maintain this reference?

我正在使用pandas DataFrame轻松删除NaN过多的行,对于我的用例来说,这可能不是正确的方法.当前代码如下:

I’m using a pandas DataFrame to easily drop the rows that have too many NaNs, this may not be the right way to do it for my use case. Current code looks like this:

class Dropna():

    # thresh is max number of NaNs allowed in a row
    def __init__(self, thresh=0):
        self.thresh = thresh

    def fit(self, X, y):
        total = X.shape[1]
        # +1 to account for 'y' being added to the dframe                                                                                                                            
        new_thresh = total + 1 - self.thresh
        df = pd.DataFrame(X)
        df['y'] = y
        df.dropna(thresh=new_thresh, inplace=True)
        X = df.drop('y', axis=1).values
        y = df['y'].values
        return self

    def transform(self, X):
        return X

推荐答案

修改示例轴,例如删除样本,(还?)不符合scikit-learn转换器API.因此,如果需要执行此操作,则应在对scikit Learn的任何调用之前进行此操作,作为预处理.

Modifying the sample axis, e.g. removing samples, does not (yet?) comply with the scikit-learn transformer API. So if you need to do this, you should do it outside any calls to scikit learn, as preprocessing.

现在,转换器API用于将给定样本的特征转换为新的特征.它可以隐式包含来自其他样本的信息,但样本永远不会删除.

As it is now, the transformer API is used to transform the features of a given sample into something new. This can implicitly contain information from other samples, but samples are never deleted.

另一个选择是尝试估算缺少的值.但同样,如果您需要删除样本,请在使用scikit Learn之前将其视为预处理.

Another option is to attempt to impute the missing values. But again, if you need to delete samples, treat it as preprocessing before using scikit learn.

这篇关于sklearn管道的自定义转换器,可同时更改X和y的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆