如何用条件列均值填充数据框的空/nan单元格 [英] How to fill dataframe's empty/nan cell with conditional column mean

查看:66
本文介绍了如何用条件列均值填充数据框的空/nan单元格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用该特定列的平均值填充(熊猫)数据框的null/empty值.

I am trying to fill the (pandas) dataframe's null/empty value using the mean of that specific column.

数据如下:

    ID  Name        Industry            Year        Revenue
    1   Treslam     Financial Services  2009        $5,387,469 
    2   Rednimdox   Construction        2013    
    3   Lamtone     IT Services         2009        $11,757,018 
    4   Stripfind   Financial Services  2010        $12,329,371 
    5   Openjocon   Construction        2013        $4,273,207 
    6   Villadox    Construction        2012        $1,097,353 
    7   Sumzoomit   Construction        2010        $7,703,652
    8   Abcddd      Construction        2019    
    .
    .

我试图用行业=='建筑'的收入"列的平均值填充该空白单元格.

I am trying to fill that empty cell with the mean of Revenue column where Industry is == 'Construction'.

要得到我们的数值平均值,我做了:

To get our numerical mean value I did:

df.groupby(['Industry'], as_index = False).mean()

我正在尝试执行以下操作以就地填充空白单元格:

I am trying to do something like this to fill up that empty cell in-place:

(df[df['Industry'] == "Construction"]['Revenue']).fillna("$21212121.01", inplace = True)

..但是不起作用.谁能告诉我如何实现它!非常感谢.

..but it is not working. Can anyone tell me how to achieve it! Thanks a lot.

预期输出:

ID  Name        Industry            Year        Revenue
1   Treslam     Financial Services  2009        $5,387,469 
2   Rednimdox   Construction        2013        $21212121.01
3   Lamtone     IT Services         2009        $11,757,018 
4   Stripfind   Financial Services  2010        $12,329,371 
5   Openjocon   Construction        2013        $4,273,207 
6   Villadox    Construction        2012        $1,097,353 
7   Sumzoomit   Construction        2010        $7,703,652
8   Abcddd      Construction        2019        $21212121.01
.
.

推荐答案

尽管用作平均值的数字不同,但我们提供了两种类型的平均值:正常平均值和根据包含NaN的病例数计算得出的平均值.

Although the numbers used as averages are different, we have presented two types of averages: the normal average and the average calculated on the number of cases that include NaN.

df['Revenue'] = df['Revenue'].replace({'\$':'', ',':''}, regex=True)
df['Revenue'] = df['Revenue'].astype(float)
df_mean = df.groupby(['Industry'], as_index = False)['Revenue'].mean()

df_mean
    Industry    Revenue
0   Construction    4.358071e+06
1   Financial Services  8.858420e+06
2   IT Services 1.175702e+07

df_mean_nan = df.groupby(['Industry'], as_index = False)['Revenue'].agg({'Sum':np.sum, 'Size':np.size})
df_mean_nan['Mean_nan'] = df_mean_nan['Sum'] / df_mean_nan['Size']

df_mean_nan

    Industry    Sum Size    Mean_nan
0   Construction    13074212.0  5.0 2614842.4
1   Financial Services  17716840.0  2.0 8858420.0
2   IT Services 11757018.0  1.0 11757018.0

考虑到NaN的平均值

df.loc[df['Revenue'].isna(),['Revenue']] = df_mean_nan.loc[df_mean_nan['Industry'] == 'Construction',['Mean_nan']].values

df
    ID  Name    Industry    Year    Revenue
0   1   Treslam Financial Services  2009    5387469.0
1   2   Rednimdox   Construction    2013    2614842.4
2   3   Lamtone IT Services 2009    11757018.0
3   4   Stripfind   Financial Services  2010    12329371.0
4   5   Openjocon   Construction    2013    4273207.0
5   6   Villadox    Construction    2012    1097353.0
6   7   Sumzoomit   Construction    2010    7703652.0
7   8   Abcddd  Construction    2019    2614842.4

正常平均值:(不包括NaN)

Normal average: (NaN is excluded)

df.loc[df['Revenue'].isna(),['Revenue']] = df_mean.loc[df_mean['Industry'] == 'Construction',['Revenue']].values

df
    ID  Name    Industry    Year    Revenue
0   1   Treslam Financial Services  2009    5.387469e+06
1   2   Rednimdox   Construction    2013    4.358071e+06
2   3   Lamtone IT Services 2009    1.175702e+07
3   4   Stripfind   Financial Services  2010    1.232937e+07
4   5   Openjocon   Construction    2013    4.273207e+06
5   6   Villadox    Construction    2012    1.097353e+06
6   7   Sumzoomit   Construction    2010    7.703652e+06
7   8   Abcddd  Construction    2019    4.358071e+06

这篇关于如何用条件列均值填充数据框的空/nan单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆