用于 sklearn 管道的自定义转换器,可同时更改 X 和 y [英] Custom transformer for sklearn Pipeline that alters both X and y

查看:20
本文介绍了用于 sklearn 管道的自定义转换器,可同时更改 X 和 y的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建自己的转换器以与 sklearn Pipeline 一起使用.

I want to create my own transformer for use with the sklearn Pipeline.

我正在创建一个实现 fit 和 transform 方法的类.转换器的目的是从矩阵中删除超过指定数量的 NaN 的行.

I am creating a class that implements both fit and transform methods. The purpose of the transformer will be to remove rows from the matrix that have more than a specified number of NaNs.

我面临的问题是如何更改传递给转换器的 X 和 y 矩阵?

我相信这必须在 fit 方法中完成,因为它可以访问 X 和 y.由于一旦我将 X 重新分配给行数较少的新矩阵,python 就会通过赋值传递参数,因此对原始 X 的引用丢失了(当然,y 也是如此).是否可以维护此引用?

I believe this has to be done in the fit method since it has access to both X and y. Since python passes arguments by assignment once I reassign X to a new matrix with fewer rows the reference to the original X is lost (and of course the same is true for y). Is it possible to maintain this reference?

我正在使用 Pandas DataFrame 轻松删除具有过多 NaN 的行,这对于我的用例来说可能不是正确的方法.当前代码如下所示:

I’m using a pandas DataFrame to easily drop the rows that have too many NaNs, this may not be the right way to do it for my use case. The current code looks like this:

class Dropna():

    # thresh is max number of NaNs allowed in a row
    def __init__(self, thresh=0):
        self.thresh = thresh

    def fit(self, X, y):
        total = X.shape[1]
        # +1 to account for 'y' being added to the dframe                                                                                                                            
        new_thresh = total + 1 - self.thresh
        df = pd.DataFrame(X)
        df['y'] = y
        df.dropna(thresh=new_thresh, inplace=True)
        X = df.drop('y', axis=1).values
        y = df['y'].values
        return self

    def transform(self, X):
        return X

推荐答案

修改样本轴,例如删除样本,不(还?)符合 scikit-learn 转换器 API.因此,如果您需要这样做,您应该在对 scikit learn 的任何调用之外进行,作为预处理.

Modifying the sample axis, e.g. removing samples, does not (yet?) comply with the scikit-learn transformer API. So if you need to do this, you should do it outside any calls to scikit learn, as preprocessing.

就像现在一样,transformer API 用于将给定样本的特征转换为新的东西.这可以隐含地包含来自其他样本的信息,但样本永远不会被删除.

As it is now, the transformer API is used to transform the features of a given sample into something new. This can implicitly contain information from other samples, but samples are never deleted.

另一种选择是尝试估算缺失值.但同样,如果需要删除样本,请在使用scikit learn之前将其视为预处理.

Another option is to attempt to impute the missing values. But again, if you need to delete samples, treat it as preprocessing before using scikit learn.

这篇关于用于 sklearn 管道的自定义转换器,可同时更改 X 和 y的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆