大 pandas -基于列值合并几乎重复的行 [英] pandas - Merge nearly duplicate rows based on column value
问题描述
我有一个pandas
数据帧,其中有几行几乎是彼此重复的,除了一个值.我的目标是将这些行合并或合并"为单行,而无需对数值进行求和.
I have a pandas
dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or "coalesce" these rows into a single row, without summing the numerical values.
以下是我正在使用的示例:
Here is an example of what I'm working with:
Name Sid Use_Case Revenue
A xx01 Voice $10.00
A xx01 SMS $10.00
B xx02 Voice $5.00
C xx03 Voice $15.00
C xx03 SMS $15.00
C xx03 Video $15.00
这就是我想要的:
Name Sid Use_Case Revenue
A xx01 Voice, SMS $10.00
B xx02 Voice $5.00
C xx03 Voice, SMS, Video $15.00
我不想汇总收入"列的原因是因为我的表是在多个时间段内进行数据透视的结果,其中收入"只是最终被多次列出,而不是每个值都有不同的值"Use_Case".
The reason I don't want to sum the "Revenue" column is because my table is the result of doing a pivot over several time periods where "Revenue" simply ends up getting listed multiple times instead of having a different value per "Use_Case".
解决此问题的最佳方法是什么?我已经研究过groupby()
函数,但是我仍然不太了解它.
What would be the best way to tackle this issue? I've looked into the groupby()
function but I still don't understand it very well.
推荐答案
我认为您可以使用 aggregate
I think you can use groupby
with aggregate
first
and custom function ', '.join
:
df = df.groupby('Name').agg({'Sid':'first',
'Use_Case': ', '.join,
'Revenue':'first' }).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS $10.00
1 B xx02 Voice $5.00
2 C xx03 Voice, SMS, Video $15.00
评论中的好主意,谢谢 Goyo :
Nice idea from comment, thanks Goyo:
df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS $10.00
1 B xx02 Voice $5.00
2 C xx03 Voice, SMS, Video $15.00
这篇关于大 pandas -基于列值合并几乎重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!