Pandas:基于列Dtype的常规数据插补 [英] Pandas: General Data Imputation Based on Column Dtype
本文介绍了Pandas:基于列Dtype的常规数据插补的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在处理具有约80列的数据集,其中许多包含NaN.我绝对不想手动检查每列的dtype
并以此为依据进行估算.
I'm working with a dataset with ~80 columns, many of which contain NaN. I definitely don't want to manually inspect dtype
for each column and impute based on that.
所以我写了一个函数来根据列的dtype
估算列的缺失值:
So I wrote a function to impute a column's missing values based on its dtype
:
def impute_df(df, col):
# if col is float, impute mean
if df[col].dtype == "int64":
df[col].fillna(df[col].mean(), inplace=True)
else:
df[col].fillna(df[col].mode()[0], inplace=True)
但是要使用此功能,我必须遍历DataFrame中的所有列,例如:
But to use this, I'd have to loop over all columns in my DataFrame, something like:
for col in train_df.columns:
impute_df(train_df, col)
而且我知道在Pandas中循环通常很慢.有没有更好的方法来解决这个问题?
And I know looping in Pandas is generally slow. Is there a better way of going about this?
谢谢!
推荐答案
I think you need select_dtypes
for numeric and non numeric columns and then apply fillna for filtered columns:
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,5,4,5,5,4],
'C':[7,8,np.nan,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':['a','a','b','b','b',np.nan]})
print (df)
A B C D E F
0 a NaN 7.0 1 5 a
1 b 5.0 8.0 3 3 a
2 c 4.0 NaN 5 6 b
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 4.0 3.0 0 4 NaN
cols1 = df.select_dtypes([np.number]).columns
cols2 = df.select_dtypes(exclude = [np.number]).columns
df[cols1] = df[cols1].fillna(df[cols1].mean())
df[cols2] = df[cols2].fillna(df[cols2].mode().iloc[0])
print (df)
A B C D E F
0 a 4.6 7.0 1 5 a
1 b 5.0 8.0 3 3 a
2 c 4.0 4.8 5 6 b
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 4.0 3.0 0 4 b
这篇关于Pandas:基于列Dtype的常规数据插补的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文