Python Pandas:用唯一值连接行 [英] Python Pandas: concatenate rows with unique values
问题描述
在Python熊猫中,我有一个看起来像这样的大数据框:
In Python pandas I have a large data frame that looks like this:
df = pd.DataFrame ({'a' : ['foo', 'bar'] * 3,
'b' : ['foo2', 'bar2'] * 3,
'c' : ['foo3', 'bar3'] * 3,
'd' : ['q','w','e','r','t','y'],
'e' : ['q2','w2','e2','r2','t2','y2']})
a b c d e
1 bar bar2 bar3 w w2
3 bar bar2 bar3 r r2
5 bar bar2 bar3 y y2
4 foo foo2 foo3 t t2
2 foo foo2 foo3 e e2
0 foo foo2 foo3 q q2
它包含许多具有重复值的列(a,b,c ...)和一些具有唯一值的列(d,e).我想删除所有重复的值并收集唯一的值,即:
It contains a dozen of columns with duplicated values (a, b, c...) and a few with unique values (d, e). I would like to remove all duplicated values and collect those that are unique, i.e.:
a b c d e
1 bar bar2 bar3 w,r,y w2,r2,y2
4 foo foo2 foo3 t,e,q t2,e2,q2
我们可以放心地假设唯一值仅在"d"和"e"中,而其余部分总是重复的.
We can safely assume that unique values are only in 'd' and 'e', while rest is always duplicated.
我可以想到的一种解决方案是对所有重复的列进行分组,然后对唯一值应用串联操作:
One way I could conceive a solution would be to groupby all duplicated columns and then apply a concatenation operation on unique values:
df.groupby([df.a, df.b, df.c]).apply(lambda x: "{%s}" % ', '.join(x.d))
一个不便之处是,如果要在输出中包含所有重复的列,则必须列出所有重复的列.问题更多的是我只连接了'd'中的字符串,同时也需要'e'.
One inconvenience is that I have to list all duplicated columns if I want to have them in my output. More of a problem is fact that I am concatenating only strings in 'd', while also 'e' is needed.
有什么建议吗?
推荐答案
我认为您可以执行以下操作:
I think you can do something like this:
>>> df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col))
d e
a b c
bar bar2 bar3 w,r,y w2,r2,y2
foo foo2 foo3 q,e,t q2,e2,t2
另一种方法,而不是列出所有列,而仅列出具有唯一值的列
Another way to do this and not to list all column but only list ones with unique values
>>> gr_columns = [x for x in df.columns if x not in ['d','e']]
>>> df.groupby(gr_columns).agg(lambda col: ','.join(col))
d e
a b c
bar bar2 bar3 w,r,y w2,r2,y2
foo foo2 foo3 q,e,t q2,e2,t2
这篇关于Python Pandas:用唯一值连接行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!