计算大 pandas 数量的最有效方法是什么? [英] what is the most efficient way of counting occurrences in pandas?
问题描述
我有一个大的(约1200万行)数据帧df,说:
I have a large (about 12M rows) dataframe df with say:
df.columns = ['word','documents','frequency']
因此,及时执行了以下操作:
So the following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
但是,这要花费很长的时间才能运行:
However, this is taking an unexpected long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
我在这里做错了什么?有没有更好的方法来计算大型数据框中的出现次数?
What am I doing wrong here? Is there a better way to count occurences in a large dataframe?
df.word.describe()
运行得很好,所以我真的没想到Occurrences_of_Words数据框会花很长时间构建.
ran pretty well, so I really did not expect this Occurrences_of_Words dataframe to take very long to build.
ps:如果答案很明显,并且您觉得有必要因提出这个问题而对我不利,请同时提供答案.谢谢.
ps: If the answer is obvious and you feel the need to penalize me for asking this question, please include the answer as well. thank you.
推荐答案
我认为df['word'].value_counts()
应该有用.通过跳过groupby机制,您可以节省一些时间.我不确定为什么count
应该比max
慢得多.两者都需要一些时间来避免丢失值. (与size
相比.)
I think df['word'].value_counts()
should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count
should be much slower than max
. Both take some time to avoid missing values. (Compare with size
.)
无论如何,对value_counts进行了进行了优化的来处理对象类型,例如您的单词,所以我怀疑你会做得更好.
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
这篇关于计算大 pandas 数量的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!