在Pandas中,如何使用给定相似索引的另一个数据帧中的值来修补缺少值的数据帧? [英] In Pandas, how can I patch a dataframe with missing values with values from another dataframe given a similar index?

查看:67
本文介绍了在Pandas中,如何使用给定相似索引的另一个数据帧中的值来修补缺少值的数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

来自 在熊猫数据框中填写缺失的行值

我有以下数据框,想填写缺失的值.

I have the following dataframe and would like to fill in missing values.

mukey   hzdept_r    hzdepb_r    sandtotal_r silttotal_r
425897      0         61        
425897      61        152          5.3         44.7
425911      0         30           30.1        54.9
425911      30        74           17.7        49.8
425911      74        84        

我希望每个缺失值都是对应于该mukey的值的平均值.在这种情况下,例如第一行缺失值将是对应于mukey == 425897的sandtotal_r和silttotal_r的平均值.熊猫fillna似乎并不能解决问题.有帮助吗?

I want each missing value to be the average of values corresponding to that mukey. In this case, e.g. the first row missing values will be the average of sandtotal_r and silttotal_r corresponding to mukey==425897. pandas fillna doesn't seem to do the trick. Any help?

虽然该代码适用于该示例中的示例数据帧,但是在我上传到此处的较大数据集上却失败了:

While the code works for the sample dataframe in that example, it is failing on the larger dataset I have uploaded here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0

import pandas as pd
df = pd.read_csv('www004.csv')
# CSV file is here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
df1 = df.set_index('mukey')
df1.fillna(df.groupby('mukey').mean(),inplace=True)
df1.reset_index()

我收到错误:InvalidIndexError.为什么它不起作用?

I get the error: InvalidIndexError. Why is it not working?

推荐答案

使用

Use combine_first. It allows you to patch up the missing data on the left dataframe with the matching data on the right dataframe based on same index.

在这种情况下,df1位于左侧,而平均值df2位于右侧.

In this case, df1 is on the left and df2, the means, as the one on the right.

In [48]: df = pd.read_csv('www004.csv')
    ...: df1 = df.set_index('mukey')
    ...: df2 = df.groupby('mukey').mean()

In [49]: df1.loc[426178,:]
Out[49]: 
        hzdept_r  hzdepb_r  sandtotal_r  silttotal_r  claytotal_r   om_r
mukey                                                                   
426178         0        36          NaN          NaN          NaN  72.50
426178        36        66          NaN          NaN          NaN  72.50
426178        66       152         42.1         37.9           20   0.25

In [50]: df2.loc[426178,:]
Out[50]: 
hzdept_r       34.000000
hzdepb_r       84.666667
sandtotal_r    42.100000
silttotal_r    37.900000
claytotal_r    20.000000
om_r           48.416667
Name: 426178, dtype: float64

In [51]: df3 = df1.combine_first(df2)
    ...: df3.loc[426178,:]
Out[51]: 
        hzdept_r  hzdepb_r  sandtotal_r  silttotal_r  claytotal_r   om_r
mukey                                                                   
426178         0        36         42.1         37.9           20  72.50
426178        36        66         42.1         37.9           20  72.50
426178        66       152         42.1         37.9           20   0.25

请注意,以下行在生成的df3

Note that the following rows still won't have values in the resulting df3

426162
426163
426174
426174
426255

因为它们是单行开头,所以.mean()对他们没有任何意义(嗯,看看我在那儿做了什么?).

because they were single rows to begin with, hence, .mean() doesn't mean anything to them (eh, see what I did there?).

这篇关于在Pandas中,如何使用给定相似索引的另一个数据帧中的值来修补缺少值的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆