Pandas groupby:如何获得字符串的并集 [英] Pandas groupby: How to get a union of strings
问题描述
我有一个这样的数据框:
I have a dataframe like this:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
打电话
In [10]: print df.groupby("A")["B"].sum()
会回来
A
1 1.615586
2 0.421821
3 0.463468
4 0.643961
现在我想对C"列做同样的事情".因为该列包含字符串,所以 sum() 不起作用(尽管您可能认为它会连接字符串).我真正想看到的是每个组的字符串列表或一组,即
Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
我一直在努力寻找方法来做到这一点.
I have been trying to find ways to do this.
Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) 不起作用,尽管
Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although
df.groupby("A")["B"]
是一个
pandas.core.groupby.SeriesGroupBy object
所以我希望任何系列方法都能奏效.有什么想法吗?
so I was hoping any Series method would work. Any ideas?
推荐答案
In [4]: df = read_csv(StringIO(data),sep='s+')
In [5]: df
Out[5]:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
In [6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object
当您应用自己的函数时,不会自动排除非数字列.但是,这比将 .sum()
应用于 groupby
When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum()
to the groupby
In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]:
A B C
A
1 2 1.615586 Thisstring
2 4 0.421821 is!
3 3 0.463468 a
4 4 0.643961 random
sum
默认连接
In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]:
A
1 Thisstring
2 is!
3 a
4 random
dtype: object
你几乎可以为所欲为
In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]:
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
dtype: object
在整个框架上执行此操作,一次一组.关键是返回一个Series
Doing this on a whole frame, one group at a time. Key is to return a Series
def f(x):
return Series(dict(A = x['A'].sum(),
B = x['B'].sum(),
C = "{%s}" % ', '.join(x['C'])))
In [14]: df.groupby('A').apply(f)
Out[14]:
A B C
A
1 2 1.615586 {This, string}
2 4 0.421821 {is, !}
3 3 0.463468 {a}
4 4 0.643961 {random}
这篇关于Pandas groupby:如何获得字符串的并集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!