计算 pandas 出现次数的最有效方法是什么? [英] what is the most efficient way of counting occurrences in pandas?
问题描述
我有一个大的(大约 1200 万行)数据帧 df 说:
I have a large (about 12M rows) dataframe df with say:
df.columns = ['word','documents','frequency']
所以以下内容及时运行:
So the following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
但是,这需要很长时间才能运行:
However, this is taking an unexpected long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
我在这里做错了什么?有没有更好的方法来计算大型数据帧中的出现次数?
What am I doing wrong here? Is there a better way to count occurences in a large dataframe?
df.word.describe()
运行得很好,所以我真的没想到这个 Occurrences_of_Words 数据框需要很长时间来构建.
ran pretty well, so I really did not expect this Occurrences_of_Words dataframe to take very long to build.
ps:如果答案是显而易见的,并且你觉得有必要惩罚我提出这个问题,请附上答案.谢谢.
ps: If the answer is obvious and you feel the need to penalize me for asking this question, please include the answer as well. thank you.
推荐答案
我认为 df['word'].value_counts()
应该服务.通过跳过 groupby 机制,您将节省一些时间.我不确定为什么 count
应该比 max
慢得多.两者都需要一些时间来避免缺失值.(对比size
.)
I think df['word'].value_counts()
should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count
should be much slower than max
. Both take some time to avoid missing values. (Compare with size
.)
无论如何,value_counts 已经专门优化来处理对象类型,比如你的话,所以我怀疑你会做得更好.
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
这篇关于计算 pandas 出现次数的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!