scikit-learn另一个特征的名义值组中特征的估算平均值 [英] scikit-learn impute mean of feature within groups of nominal value in another feature

查看:67
本文介绍了scikit-learn另一个特征的名义值组中特征的估算平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想估算特征的均值,但仅根据另一列中具有相同类别/标称值的其他示例来计算均值,我想知道使用scikit-learn的Imputer类是否可能?这样只会更容易将其添加到管道中.

I want to impute the mean of a feature but only calculate the mean based off other examples that have the same category/nominal value in another column and I was wondering if this was possible using scikit-learn's Imputer class? It would just make it easier to add into a pipeline that way.

例如:

使用kaggle中的泰坦尼克号数据集:

Using the Titanic dataset from kaggle: source

我该如何估算每个 pclass 的平均票价.其背后的想法是,不同阶级的人机票之间的成本差异会很大.

How would I go about imputing the mean fare per pclass. The thinking behind it being that people in different classes would have large differences in cost between tickets.

更新:在与一些人讨论之后,我应该使用的短语是在班级内推算平均值 ".

Update: After discussion with some people, the phrase I should have used was "imputing the mean within class".

我已在下面查看了 Vivek 的评论,并在有时间的时候构建了通用的管道函数做我想做的事情:)我对如何做有个好主意,并会在完成时作为答案发布.

I've looked into Vivek's comment below and will construct a generic pipeline function when I get time to do what I want :) I have a good idea of how to do it and will post as an answer when it's finished.

推荐答案

因此,下面是解决我的问题的一种非常简单的方法,该方法仅用于处理事物.一个更可靠的实现可能涉及利用scikit learning中的Imputer类,这意味着它也可以进行模式,中值等处理,并且在处理稀疏/密集矩阵方面会更好.

So below is a pretty simple approach to my question that was just meant to handle the means of things. A more robust implementation would probably involve utilising the Imputer class from scikit learn which would mean it could also do mode, median, etc. and would be better at dealing with sparse/dense matrices.

这是基于Vivek Kumar对原始问题的评论提出的,该评论建议将数据分成堆栈并以这种方式进行插补,然后重新组装它们.

This is based on Vivek Kumar's comment on the original question which suggested splitting the data into stacks and imputing it that way then re-assembling them.

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class WithinClassMeanImputer(BaseEstimator, TransformerMixin):
    def __init__(self, replace_col_index, class_col_index = None, missing_values=np.nan):
        self.missing_values = missing_values
        self.replace_col_index = replace_col_index
        self.y = None
        self.class_col_index = class_col_index

    def fit(self, X, y = None):
        self.y = y
        return self

    def transform(self, X):
        y = self.y
        classes = np.unique(y)
        stacks = []

        if len(X) > 1 and len(self.y) = len(X):
            if( self.class_col_index == None ):
                # If we're using the dependent variable
                for aclass in classes:
                    with_missing = X[(y == aclass) & 
                                        (X[:, self.replace_col_index] == self.missing_values)]
                    without_missing = X[(y == aclass) & 
                                            (X[:, self.replace_col_index] != self.missing_values)]

                    column = without_missing[:, self.replace_col_index]
                    # Calculate mean from examples without missing values
                    mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])

                    # Broadcast mean to all missing values
                    with_missing[:, self.replace_col_index] = mean

                    stacks.append(np.concatenate((with_missing, without_missing)))
            else:
                # If we're using nominal values within a binarised feature (i.e. the classes
                # are unique values within a nominal column - e.g. sex)
                for aclass in classes:
                    with_missing = X[(X[:, self.class_col_index] == aclass) & 
                                        (X[:, self.replace_col_index] == self.missing_values)]
                    without_missing = X[(X[:, self.class_col_index] == aclass) & 
                                            (X[:, self.replace_col_index] != self.missing_values)]

                    column = without_missing[:, self.replace_col_index]
                    # Calculate mean from examples without missing values
                    mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])

                    # Broadcast mean to all missing values
                    with_missing[:, self.replace_col_index] = mean
                    stacks.append(np.concatenate((with_missing, without_missing)))

            if len(stacks) > 1 :
                # Reassemble our stacks of values
                X = np.concatenate(stacks)

        return X

这篇关于scikit-learn另一个特征的名义值组中特征的估算平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆