通过 pandas 中其他值的比率来填充缺失值 [英] Fill missing Values by a ratio of other values in Pandas

查看:68
本文介绍了通过 pandas 中其他值的比率来填充缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Pandas的数据框中有一个列,缺少大约78%的值.

I have a column in a Dataframe in Pandas with around 78% missing values.

剩余的22%值按以下比率分为三个标签-SC,ST,GEN.

The remaining 22% values are divided between three labels - SC, ST, GEN with the following ratios.

SC-16% ST-8% GEN-76%

SC - 16% ST - 8% GEN - 76%

我需要用上面的三个值替换缺失的值,以便所有元素的比例与上面相同.只要比例保持在上面,分配就可以是随机的.

I need to replace the missing values by the above three values so that the ratio of all the elements remains same as above. The assignment can be random as long the the ratio remains as above.

我如何做到这一点?

推荐答案

从此DataFrame开始(仅用于创建类似于您的数据):

Starting with this DataFrame (only to create something similar to yours):

import numpy as np
df = pd.DataFrame({'C1': np.random.choice(['SC', 'ST', 'GEN'], p=[0.16, 0.08, 0.76], 
                                          size=1000)})
df.loc[df.sample(frac=0.22).index] = np.nan

它产生的NaN含量为22%,其余比例与您相似:

It yields a column with 22% NaN and the remaining proportions are similar to yours:

df['C1'].value_counts(normalize=True, dropna=False)
Out: 
GEN    0.583
NaN    0.220
SC     0.132
ST     0.065
Name: C1, dtype: float64

df['C1'].value_counts(normalize=True)
Out: 
GEN    0.747436
SC     0.169231
ST     0.083333
Name: C1, dtype: float64

现在您可以将fillna与np.random.choice结合使用:

Now you can use fillna with np.random.choice:

df['C1'] = df['C1'].fillna(pd.Series(np.random.choice(['SC', 'ST', 'GEN'], 
                                                      p=[0.16, 0.08, 0.76], size=len(df))))

结果列将具有以下比例:

The resulting column will have these proportions:

df['C1'].value_counts(normalize=True, dropna=False)
Out: 
GEN    0.748
SC     0.165
ST     0.087
Name: C1, dtype: float64

这篇关于通过 pandas 中其他值的比率来填充缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆