Python Pandas:用唯一值连接行 [英] Python Pandas: concatenate rows with unique values

查看:98
本文介绍了Python Pandas:用唯一值连接行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python熊猫中,我有一个看起来像这样的大数据框:

In Python pandas I have a large data frame that looks like this:

df = pd.DataFrame ({'a' : ['foo', 'bar'] * 3,
             'b' : ['foo2', 'bar2'] * 3,
             'c' : ['foo3', 'bar3'] * 3,
             'd' : ['q','w','e','r','t','y'],
             'e' : ['q2','w2','e2','r2','t2','y2']})


     a     b     c  d   e
1  bar  bar2  bar3  w  w2
3  bar  bar2  bar3  r  r2
5  bar  bar2  bar3  y  y2
4  foo  foo2  foo3  t  t2
2  foo  foo2  foo3  e  e2
0  foo  foo2  foo3  q  q2

它包含许多具有重复值的列(a,b,c ...)和一些具有唯一值的列(d,e).我想删除所有重复的值并收集唯一的值,即:

It contains a dozen of columns with duplicated values (a, b, c...) and a few with unique values (d, e). I would like to remove all duplicated values and collect those that are unique, i.e.:

     a     b     c  d   e
1  bar  bar2  bar3  w,r,y  w2,r2,y2
4  foo  foo2  foo3  t,e,q  t2,e2,q2

我们可以放心地假设唯一值仅在"d"和"e"中,而其余部分总是重复的.

We can safely assume that unique values are only in 'd' and 'e', while rest is always duplicated.

我可以想到的一种解决方案是对所有重复的列进行分组,然后对唯一值应用串联操作:

One way I could conceive a solution would be to groupby all duplicated columns and then apply a concatenation operation on unique values:

df.groupby([df.a, df.b, df.c]).apply(lambda x: "{%s}" % ', '.join(x.d))

一个不便之处是,如果要在输出中包含所有重复的列,则必须列出所有重复的列.问题更多的是我只连接了'd'中的字符串,同时也需要'e'.

One inconvenience is that I have to list all duplicated columns if I want to have them in my output. More of a problem is fact that I am concatenating only strings in 'd', while also 'e' is needed.

有什么建议吗?

推荐答案

我认为您可以执行以下操作:

I think you can do something like this:

>>> df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2

另一种方法,而不是列出所有列,而仅列出具有唯一值的列

Another way to do this and not to list all column but only list ones with unique values

>>> gr_columns = [x for x in df.columns if x not in ['d','e']]
>>> df.groupby(gr_columns).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2

这篇关于Python Pandas:用唯一值连接行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆