如何在Pandas中使用分组模式替换缺失值? [英] How to replace missing values with group mode in Pandas?

查看:92
本文介绍了如何在Pandas中使用分组模式替换缺失值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遵循

I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds".

 df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x.mode()[0]))

我想这可能是因为某些组缺少所有值并且没有模式.有办法解决这个问题吗?谢谢!

I guess this is probably because some groups have all missing values and do not have a mode. Is there a way to get around this? Thank you!

推荐答案

mode相当困难,因为实际上并没有商定的解决关系的方法.另外,它通常非常慢.这是一种快速"的方法.我们将定义一个函数来计算每个组的模式,然后用map填充缺失的值.我们不会遇到缺少组的问题,尽管对于关系,我们可以随意选择排序时首先出现的模式值:

mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing values afterwards with a map. We don't run into issues with missing groups, though for ties we arbitrarily choose the modal value that comes first when sorted:

def fast_mode(df, key_cols, value_col):
    """ 
    Calculate a column mode, by group, ignoring null values. 

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame over which to calcualate the mode. 
    key_cols : list of str
        Columns to groupby for calculation of mode.
    value_col : str
        Column for which to calculate the mode. 

    Return
    ------ 
    pandas.DataFrame
        One row for the mode of value_col per key_cols group. If ties, 
        returns the one which is sorted first. 
    """
    return (df.groupby(key_cols + [value_col]).size() 
              .to_frame('counts').reset_index() 
              .sort_values('counts', ascending=False) 
              .drop_duplicates(subset=key_cols)).drop(columns='counts')

样本数据df:

   CIK  SIK
0    C  2.0
1    C  1.0
2    B  NaN
3    B  3.0
4    A  NaN
5    A  3.0
6    C  NaN
7    B  NaN
8    C  1.0
9    A  2.0
10   D  NaN
11   D  NaN
12   D  NaN

代码:

df.loc[df.SIK.isnull(), 'SIK'] = df.CIK.map(fast_mode(df, ['CIK'], 'SIK').set_index('CIK').SIK)

输出df:

   CIK  SIK
0    C  2.0
1    C  1.0
2    B  3.0
3    B  3.0
4    A  2.0
5    A  3.0
6    C  1.0
7    B  3.0
8    C  1.0
9    A  2.0
10   D  NaN
11   D  NaN
12   D  NaN

这篇关于如何在Pandas中使用分组模式替换缺失值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆