pandas -用groupby均值替换离群值 [英] Pandas - Replace outliers with groupby mean

查看:178
本文介绍了 pandas -用groupby均值替换离群值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,我想将其分为几组,计算均值和标准差,然后将所有离群值替换为该组的均值.如果离群值离群均值超过3个标准差,则定义为离群值.

I have a pandas dataframe which I would like to split into groups, calculate the mean and standard deviation, and then replace all outliers with the mean of the group. Outliers are defined as such if they are more than 3 standard deviations away from the group mean.

df = pandas.DataFrame({'a': ['A','A','A','B','B','B','B'], 'b': [1.1,1.2,1.1,3.3,3.4,3.3,100.0]})

我认为以下方法会起作用:

I thought that the following would work:

df.groupby('a')['b'].transform(lambda x: x[i] if np.abs(x[i]-x.mean())<=(3*x.std()) else x.mean() for i in range(0,len(x)))

但出现以下错误:

NameError:未定义名称'x'

NameError: name 'x' is not defined

我也尝试过单独定义转换函数:

I have also tried defining a transform function separately:

def trans_func(x):
    mean = x.mean()
    std = x.std()
    length = len(x)
    for i in range(0,length):
        if abs(x[i]-mean)<=(3*std):
            return x
        else:
            return mean

然后像这样调用它:

df.groupby('a')['b'].transform(lambda x: trans_func(x))

但出现另一个错误:

KeyError:0

KeyError: 0

最后,我完全创建了一个单独的列:

Finally, I resorted to creating a separate column altogether:

df['c'] = [df.groupby('a')['b'].transform(mean) if df.groupby('a')['b'].transform(lambda x: (x - x.mean()) / x.std()) > 3 else df['b']] 

但是这也不起作用:

ValueError:系列的真值不明确.使用a.empty,a.bool(),a.item(),a.any()或a.all().

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

任何建议,不胜感激.

推荐答案

尝试一下:

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

注意:如果要消除最后一组中的100,可以将3*std替换为1*std.该组的标准偏差为48.33,因此将其包括在结果中.

Note: If you want to eliminate the 100 in your last group you can replace 3*std by just 1*std. The standard deviation in this group is 48.33 so it would be included in the result.

这篇关于 pandas -用groupby均值替换离群值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆