scikit-learn 估算另一个特征中标称值组内特征的均值 [英] scikit-learn impute mean of feature within groups of nominal value in another feature

查看:27
本文介绍了scikit-learn 估算另一个特征中标称值组内特征的均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想估算一个特征的平均值,但只根据在另一列中具有相同类别/名义值的其他示例计算平均值,我想知道这是否可以使用 scikit-learn 的 Imputer 类?这样可以更轻松地添加到管道中.

I want to impute the mean of a feature but only calculate the mean based off other examples that have the same category/nominal value in another column and I was wondering if this was possible using scikit-learn's Imputer class? It would just make it easier to add into a pipeline that way.

例如:

使用来自 kaggle 的 Titanic 数据集:来源

Using the Titanic dataset from kaggle: source

我将如何计算每个 pclass 的平均 fare.其背后的想法是,不同班级的人在门票之间的成本会有很大差异.

How would I go about imputing the mean fare per pclass. The thinking behind it being that people in different classes would have large differences in cost between tickets.

更新:在与一些人讨论后,我应该使用的短语是在课堂内输入平均值".

Update: After discussion with some people, the phrase I should have used was "imputing the mean within class".

我已经查看了下面 Vivek 的评论,并会在我有时间时构建一个通用的管道函数做我想做的事 :) 我很清楚如何去做,并会在完成后作为答案发布.

I've looked into Vivek's comment below and will construct a generic pipeline function when I get time to do what I want :) I have a good idea of how to do it and will post as an answer when it's finished.

推荐答案

所以下面是一个非常简单的方法来解决我的问题,它只是为了处理事物的手段.更健壮的实现可能涉及使用 scikit learn 中的 Imputer 类,这意味着它也可以执行众数、中值等,并且会更好地处理稀疏/密集矩阵.

So below is a pretty simple approach to my question that was just meant to handle the means of things. A more robust implementation would probably involve utilising the Imputer class from scikit learn which would mean it could also do mode, median, etc. and would be better at dealing with sparse/dense matrices.

这是基于 Vivek Kumar 对原始问题的评论,该评论建议将数据拆分为堆栈并以这种方式进行估算,然后重新组合它们.

This is based on Vivek Kumar's comment on the original question which suggested splitting the data into stacks and imputing it that way then re-assembling them.

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class WithinClassMeanImputer(BaseEstimator, TransformerMixin):
    def __init__(self, replace_col_index, class_col_index = None, missing_values=np.nan):
        self.missing_values = missing_values
        self.replace_col_index = replace_col_index
        self.y = None
        self.class_col_index = class_col_index

    def fit(self, X, y = None):
        self.y = y
        return self

    def transform(self, X):
        y = self.y
        classes = np.unique(y)
        stacks = []

        if len(X) > 1 and len(self.y) = len(X):
            if( self.class_col_index == None ):
                # If we're using the dependent variable
                for aclass in classes:
                    with_missing = X[(y == aclass) & 
                                        (X[:, self.replace_col_index] == self.missing_values)]
                    without_missing = X[(y == aclass) & 
                                            (X[:, self.replace_col_index] != self.missing_values)]

                    column = without_missing[:, self.replace_col_index]
                    # Calculate mean from examples without missing values
                    mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])

                    # Broadcast mean to all missing values
                    with_missing[:, self.replace_col_index] = mean

                    stacks.append(np.concatenate((with_missing, without_missing)))
            else:
                # If we're using nominal values within a binarised feature (i.e. the classes
                # are unique values within a nominal column - e.g. sex)
                for aclass in classes:
                    with_missing = X[(X[:, self.class_col_index] == aclass) & 
                                        (X[:, self.replace_col_index] == self.missing_values)]
                    without_missing = X[(X[:, self.class_col_index] == aclass) & 
                                            (X[:, self.replace_col_index] != self.missing_values)]

                    column = without_missing[:, self.replace_col_index]
                    # Calculate mean from examples without missing values
                    mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])

                    # Broadcast mean to all missing values
                    with_missing[:, self.replace_col_index] = mean
                    stacks.append(np.concatenate((with_missing, without_missing)))

            if len(stacks) > 1 :
                # Reassemble our stacks of values
                X = np.concatenate(stacks)

        return X

这篇关于scikit-learn 估算另一个特征中标称值组内特征的均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆