一种基于 pandas 中的组用行值填充列的优雅方法 [英] Elegant way to fill in a column with row values based on groups in pandas

查看:52
本文介绍了一种基于 pandas 中的组用行值填充列的优雅方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据帧

I have a dataframe as given below

data_file= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,3,3,3],'ob.date': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                 'observation': ['Age','interviewdate','marital_status','interviewdate','Age','interviewdate','marital_status','Age','interviewdate','marital_status'],
                 'answer': [21,'21/08/2017','Single','22/05/2217', 26,'11/03/2010','Single',41,'31/09/2012','Married']
                 })

我想做的是,获取 answer 列中的日期值,并将其放在 ob.date 列中。提供的数据帧显示 person_id = 1 2017/08/21 和<$ c $上回答了有关年龄的问题c> 22/05/2017 他回答了有关婚姻状况

What I would like to do is, fetch the date values from answer column and put it in ob.date column. The dataframe provided shows that person_id =1 answered question about Age on 21/08/2017 and on 22/05/2017 he answered question about marital_status

的问题我尝试根据另一篇帖子的SO建议

This is what I tried based on SO suggestion from another post

s = data_file[(data_file.observation == 'interviewdate')].set_index('person_id')['answer']
data_file['ob.date'] = data_file['person_id'].map(s)

但这无法正常工作,因为出现重复索引错误。我该如何避免该问题并使其足够有效?

But this does not work as I get duplicate index error. How can I avoid that issue and make it efficient enough?

因此,任何简洁有效的解决方案都将有所帮助。 Person_id = 1具有两个日期值,因此请使用 answer 列(采访日期观察)

So any elegant and efficient solution would be helpful. Person_id = 1 has two date values, so fill all rows above interviewdate observation with the value from answer column (of interviewdate observation)

我怎么期望我的输出像这样?

How can I expect my output to be like this?

推荐答案

所有数据都取决于-第一组 answer 按条件创建新列,然后每个组通过前后填充来弥补缺失值:

All depends of data - first set new column by answer by condition and then per groups repalce missing values by back and forward filling:

data_file['ob.date']  = data_file.loc[(data_file.observation == 'interviewdate'), 'answer']
data_file['ob.date'] = (data_file.groupby('person_id')['ob.date']
                                 .apply(lambda x: x.bfill().ffill()))


print (data_file)

   person_id     ob.date     observation      answer
0          1  21/08/2017             Age          21
1          1  21/08/2017   interviewdate  21/08/2017
2          1  22/05/2217  marital_status      Single
3          1  22/05/2217   interviewdate  22/05/2217
4          2  11/03/2010             Age          26
5          2  11/03/2010   interviewdate  11/03/2010
6          2  11/03/2010  marital_status      Single
7          3  31/09/2012             Age          41
8          3  31/09/2012   interviewdate  31/09/2012
9          3  31/09/2012  marital_status     Married

详细信息

首先使用反向归档组,因为采访日期是边缘行-之前的所有值都是相同的子组。最后是添加forwrd填充以代替每组最后的NaN-而不是用 bfill 代替:

First is used back fiiling per groups, because interviewdate are edge rows - all values before are same subgroups. Last is add forwrd filling for repalce last NaNs per groups - not replaced by bfill:

data_file['ob.date'] = (data_file.groupby('person_id')['ob.date']
                                   .apply(lambda x: x.bfill()))


print (data_file)

   person_id     ob.date     observation      answer
0          1  21/08/2017             Age          21
1          1  21/08/2017   interviewdate  21/08/2017
2          1  22/05/2217  marital_status      Single
3          1  22/05/2217   interviewdate  22/05/2217
4          2  11/03/2010             Age          26
5          2  11/03/2010   interviewdate  11/03/2010
6          2         NaN  marital_status      Single
7          3  31/09/2012             Age          41
8          3  31/09/2012   interviewdate  31/09/2012
9          3         NaN  marital_status     Married

这篇关于一种基于 pandas 中的组用行值填充列的优雅方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆