python pandas，DF.groupby（）。agg（），列引用在agg（） [英] python pandas, DF.groupby().agg(), column reference in agg()

查看：252 发布时间：2018/5/30 13:37:26 python pandas group-by split-apply-combine

本文介绍了python pandas，DF.groupby（）。agg（），列引用在agg（）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在一个具体问题上，假设我有一个DataFrame DF

 字标记数
 0 a S 30 
 1 S 20 
 2 a T 60 
 3 an T 5 
 4 T 10

我想为每个单词找到，计数最多的标签。因此，退货将会是类似于

pre $ code>字标记数
1 S 20
2 a T 60
3 an T 5

我不在乎count列或订单/索引是原创的或搞砸了。返回字典{''：'S'，...}就好了。

我希望我能做到
DF.groupby（['word']）。agg（lambda x：x ['tag'] [x ['count'] .argmax（）]）
但它不起作用。我无法访问列信息。

更抽象地说， 函数中的函数 >）看作是它的参数？

btw，是.agg（）与.aggregate（）相同吗？

非常感谢。
解决方案
agg 与聚合相同。它是可调用的，一次一个地传递 DataFrame 的列（系列对象）。

您可以使用 idxmax 来收集最大值为$的行的索引标签b $ b count：

idx = df.groupby（'word'）['count']。idxmax（） print（idx）
产生

word a 2 an 3 the 1 名称：count

然后使用 loc 选择字中的那些行和标记列：

print（df.loc [idx， ['word'，'tag']]）
yield

字标记 2 a T 3和T 1 S
请注意， idxmax 会返回索引标签。 df.loc 可用于通过标签选择行
。但是，如果索引不是唯一的 - 也就是说，如果行有重复的索引标签 - 那么 df.loc 将选择所有行标签中列出 idx 。因此，如果您使用 idxmax df.index.is_unique 为 True c $ c> with df.loc

可以使用 apply 。 apply 的可调用函数被传递给一个子数据框，它可以访问所有的列：

import pandas as pd df = pd.DataFrame（{'word'：'a a the'.split（）， 'tag'：list（'SSTTT' ）， 'count'：[30,20,60,5,10]}） print（df.groupby（'word'）。apply（lambda subf：subf [' tag'] [subf ['count']。idxmax（）]））
yield
word a T an T S

使用 idxmax 和 loc 通常比 apply 更快，特别是对于大型DataFrame。使用IPython的％timeit：

N = 10000 df = pd.DataFrame（{'word'：'a the （'SSTTT'）* N， 'count'：[30，20，60，5，10] * N} ） def using_apply（df）： return（df.groupby（'word'）。apply（lambda subf：subf ['tag'] [subf ['count']。idxmax（）]）） def using_idxmax_loc（df）： idx = df.groupby（'word'）['count']。idxmax（） return df.loc [idx，[ 'b'b'b $ b在[22]中：％timeit using_apply（df） 100个循环，最好是3：每个循环7.68 ms 在[23]中：％timeit using_idxmax_loc（df） 100个循环，最好为3：每循环5.43 ms

如果你想要一个字典将单词映射到标签，那么你可以使用 set_index
和 to_dict 像这样：
In [36]：df2 = df。 loc [idx，['word'，'tag']]。set_index（'word'） In [37]：df2 Out [37]： tag word a T an T S In [38]：df2.to_dict（）['tag'] Out [38]：{'a'：'T'，'an'：'T'，'the'：'S' }

On a concrete problem, say I have a DataFrame DF
word tag count 0 a S 30 1 the S 20 2 a T 60 3 an T 5 4 the T 10
I want to find, for every "word", the "tag" that has the most "count". So the return would be something like
word tag count 1 the S 20 2 a T 60 3 an T 5
I don't care about the count column or if the order/Index is original or messed up. Returning a dictionary {'the' : 'S', ...} is just fine.

I hope I can do
DF.groupby(['word']).agg(lambda x: x['tag'][ x['count'].argmax() ] )
but it doesn't work. I can't access column information.

More abstractly, what does the function in agg(function) see as its argument?

btw, is .agg() the same as .aggregate() ?

Many thanks.
解决方案
agg is the same as aggregate. It's callable is passed the columns (Series objects) of the DataFrame, one at a time.

You could use idxmax to collect the index labels of the rows with the maximum count:
idx = df.groupby('word')['count'].idxmax() print(idx)
yields
word a 2 an 3 the 1 Name: count
and then use loc to select those rows in the word and tag columns:
print(df.loc[idx, ['word', 'tag']])
yields
word tag 2 a T 3 an T 1 the S
Note that idxmax returns index labels. df.loc can be used to select rows by label. But if the index is not unique -- that is, if there are rows with duplicate index labels -- then df.loc will select all rows with the labels listed in idx. So be careful that df.index.is_unique is True if you use idxmax with df.loc

Alternative, you could use apply. apply's callable is passed a sub-DataFrame which gives you access to all the columns:
import pandas as pd df = pd.DataFrame({'word':'a the a an the'.split(), 'tag': list('SSTTT'), 'count': [30, 20, 60, 5, 10]}) print(df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))
yields
word a T an T the S

Using idxmax and loc is typically faster than apply, especially for large DataFrames. Using IPython's %timeit:
N = 10000 df = pd.DataFrame({'word':'a the a an the'.split()*N, 'tag': list('SSTTT')*N, 'count': [30, 20, 60, 5, 10]*N}) def using_apply(df): return (df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()])) def using_idxmax_loc(df): idx = df.groupby('word')['count'].idxmax() return df.loc[idx, ['word', 'tag']] In [22]: %timeit using_apply(df) 100 loops, best of 3: 7.68 ms per loop In [23]: %timeit using_idxmax_loc(df) 100 loops, best of 3: 5.43 ms per loop

If you want a dictionary mapping words to tags, then you could use set_index and to_dict like this:
In [36]: df2 = df.loc[idx, ['word', 'tag']].set_index('word') In [37]: df2 Out[37]: tag word a T an T the S In [38]: df2.to_dict()['tag'] Out[38]: {'a': 'T', 'an': 'T', 'the': 'S'}

这篇关于python pandas，DF.groupby（）。agg（），列引用在agg（）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

python pandas，DF.groupby（）。agg（），列引用在agg（） [英] python pandas, DF.groupby().agg(), column reference in agg()

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

python pandas，DF.groupby（）。agg（），列引用在agg（） [英] python pandas, DF.groupby().agg(), column reference in agg()

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭