大 pandas -基于列值合并几乎重复的行 [英] pandas - Merge nearly duplicate rows based on column value

查看:81
本文介绍了大 pandas -基于列值合并几乎重复的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas数据帧,其中有几行几乎是彼此重复的,除了一个值.我的目标是将这些行合并或合并"为单行,而无需对数值进行求和.

I have a pandas dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or "coalesce" these rows into a single row, without summing the numerical values.

以下是我正在使用的示例:

Here is an example of what I'm working with:

Name   Sid   Use_Case  Revenue
A      xx01  Voice     $10.00
A      xx01  SMS       $10.00
B      xx02  Voice     $5.00
C      xx03  Voice     $15.00
C      xx03  SMS       $15.00
C      xx03  Video     $15.00

这就是我想要的:

Name   Sid   Use_Case            Revenue
A      xx01  Voice, SMS          $10.00
B      xx02  Voice               $5.00
C      xx03  Voice, SMS, Video   $15.00

我不想汇总收入"列的原因是因为我的表是在多个时间段内进行数据透视的结果,其中收入"只是最终被多次列出,而不是每个值都有不同的值"Use_Case".

The reason I don't want to sum the "Revenue" column is because my table is the result of doing a pivot over several time periods where "Revenue" simply ends up getting listed multiple times instead of having a different value per "Use_Case".

解决此问题的最佳方法是什么?我已经研究过groupby()函数,但是我仍然不太了解它.

What would be the best way to tackle this issue? I've looked into the groupby() function but I still don't understand it very well.

推荐答案

我认为您可以使用

I think you can use groupby with aggregate first and custom function ', '.join:

df = df.groupby('Name').agg({'Sid':'first', 
                             'Use_Case': ', '.join, 
                             'Revenue':'first' }).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  $10.00
1    B  xx02              Voice   $5.00
2    C  xx03  Voice, SMS, Video  $15.00

评论中的好主意,谢谢 Goyo :

Nice idea from comment, thanks Goyo:

df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  $10.00
1    B  xx02              Voice   $5.00
2    C  xx03  Voice, SMS, Video  $15.00

这篇关于大 pandas -基于列值合并几乎重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆