在Pandas中,如何使用给定相似索引的另一个数据帧中的值来修补缺少值的数据帧? [英] In Pandas, how can I patch a dataframe with missing values with values from another dataframe given a similar index?
问题描述
我有以下数据框,想填写缺失的值.
I have the following dataframe and would like to fill in missing values.
mukey hzdept_r hzdepb_r sandtotal_r silttotal_r
425897 0 61
425897 61 152 5.3 44.7
425911 0 30 30.1 54.9
425911 30 74 17.7 49.8
425911 74 84
我希望每个缺失值都是对应于该mukey的值的平均值.在这种情况下,例如第一行缺失值将是对应于mukey == 425897的sandtotal_r和silttotal_r的平均值.熊猫fillna似乎并不能解决问题.有帮助吗?
I want each missing value to be the average of values corresponding to that mukey. In this case, e.g. the first row missing values will be the average of sandtotal_r and silttotal_r corresponding to mukey==425897. pandas fillna doesn't seem to do the trick. Any help?
虽然该代码适用于该示例中的示例数据帧,但是在我上传到此处的较大数据集上却失败了:
While the code works for the sample dataframe in that example, it is failing on the larger dataset I have uploaded here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
import pandas as pd
df = pd.read_csv('www004.csv')
# CSV file is here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
df1 = df.set_index('mukey')
df1.fillna(df.groupby('mukey').mean(),inplace=True)
df1.reset_index()
我收到错误:InvalidIndexError.为什么它不起作用?
I get the error: InvalidIndexError. Why is it not working?
推荐答案
Use combine_first
. It allows you to patch up the missing data on the left dataframe with the matching data on the right dataframe based on same index.
在这种情况下,df1
位于左侧,而平均值df2
位于右侧.
In this case, df1
is on the left and df2
, the means, as the one on the right.
In [48]: df = pd.read_csv('www004.csv')
...: df1 = df.set_index('mukey')
...: df2 = df.groupby('mukey').mean()
In [49]: df1.loc[426178,:]
Out[49]:
hzdept_r hzdepb_r sandtotal_r silttotal_r claytotal_r om_r
mukey
426178 0 36 NaN NaN NaN 72.50
426178 36 66 NaN NaN NaN 72.50
426178 66 152 42.1 37.9 20 0.25
In [50]: df2.loc[426178,:]
Out[50]:
hzdept_r 34.000000
hzdepb_r 84.666667
sandtotal_r 42.100000
silttotal_r 37.900000
claytotal_r 20.000000
om_r 48.416667
Name: 426178, dtype: float64
In [51]: df3 = df1.combine_first(df2)
...: df3.loc[426178,:]
Out[51]:
hzdept_r hzdepb_r sandtotal_r silttotal_r claytotal_r om_r
mukey
426178 0 36 42.1 37.9 20 72.50
426178 36 66 42.1 37.9 20 72.50
426178 66 152 42.1 37.9 20 0.25
请注意,以下行在生成的df3
Note that the following rows still won't have values in the resulting df3
426162
426163
426174
426174
426255
因为它们是单行开头,所以.mean()
对他们没有任何意义(嗯,看看我在那儿做了什么?).
because they were single rows to begin with, hence, .mean()
doesn't mean anything to them (eh, see what I did there?).
这篇关于在Pandas中,如何使用给定相似索引的另一个数据帧中的值来修补缺少值的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!