计算大 pandas 数量的最有效方法是什么? [英] what is the most efficient way of counting occurrences in pandas?

查看:86
本文介绍了计算大 pandas 数量的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的(约1200万行)数据帧df,说:

I have a large (about 12M rows) dataframe df with say:

df.columns = ['word','documents','frequency']

因此,及时执行了以下操作:

So the following ran in a timely fashion:

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

但是,这要花费很长的时间才能运行:

However, this is taking an unexpected long time to run:

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

我在这里做错了什么?有没有更好的方法来计算大型数据框中的出现次数?

What am I doing wrong here? Is there a better way to count occurences in a large dataframe?

df.word.describe()

运行得很好,所以我真的没想到Occurrences_of_Words数据框会花很长时间构建.

ran pretty well, so I really did not expect this Occurrences_of_Words dataframe to take very long to build.

ps:如果答案很明显,并且您觉得有必要因提出这个问题而对我不利,请同时提供答案.谢谢.

ps: If the answer is obvious and you feel the need to penalize me for asking this question, please include the answer as well. thank you.

推荐答案

我认为df['word'].value_counts()应该有用.通过跳过groupby机制,您可以节省一些时间.我不确定为什么count应该比max慢得多.两者都需要一些时间来避免丢失值. (与size相比.)

I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)

无论如何,对value_counts进行了进行了优化的来处理对象类型,例如您的单词,所以我怀疑你会做得更好.

In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.

这篇关于计算大 pandas 数量的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆