通过 pandas 中其他值的比率来填充缺失值 [英] Fill missing Values by a ratio of other values in Pandas
问题描述
我在Pandas的数据框中有一个列,缺少大约78%的值.
I have a column in a Dataframe in Pandas with around 78% missing values.
剩余的22%值按以下比率分为三个标签-SC,ST,GEN.
The remaining 22% values are divided between three labels - SC, ST, GEN with the following ratios.
SC-16% ST-8% GEN-76%
SC - 16% ST - 8% GEN - 76%
我需要用上面的三个值替换缺失的值,以便所有元素的比例与上面相同.只要比例保持在上面,分配就可以是随机的.
I need to replace the missing values by the above three values so that the ratio of all the elements remains same as above. The assignment can be random as long the the ratio remains as above.
我如何做到这一点?
推荐答案
从此DataFrame开始(仅用于创建类似于您的数据):
Starting with this DataFrame (only to create something similar to yours):
import numpy as np
df = pd.DataFrame({'C1': np.random.choice(['SC', 'ST', 'GEN'], p=[0.16, 0.08, 0.76],
size=1000)})
df.loc[df.sample(frac=0.22).index] = np.nan
它产生的NaN含量为22%,其余比例与您相似:
It yields a column with 22% NaN and the remaining proportions are similar to yours:
df['C1'].value_counts(normalize=True, dropna=False)
Out:
GEN 0.583
NaN 0.220
SC 0.132
ST 0.065
Name: C1, dtype: float64
df['C1'].value_counts(normalize=True)
Out:
GEN 0.747436
SC 0.169231
ST 0.083333
Name: C1, dtype: float64
现在您可以将fillna与np.random.choice结合使用:
Now you can use fillna with np.random.choice:
df['C1'] = df['C1'].fillna(pd.Series(np.random.choice(['SC', 'ST', 'GEN'],
p=[0.16, 0.08, 0.76], size=len(df))))
结果列将具有以下比例:
The resulting column will have these proportions:
df['C1'].value_counts(normalize=True, dropna=False)
Out:
GEN 0.748
SC 0.165
ST 0.087
Name: C1, dtype: float64
这篇关于通过 pandas 中其他值的比率来填充缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!