计算 pandas 出现次数的最有效方法是什么? [英] what is the most efficient way of counting occurrences in pandas?

查看:38
本文介绍了计算 pandas 出现次数的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的(大约 1200 万行)数据帧 df 说:

I have a large (about 12M rows) dataframe df with say:

df.columns = ['word','documents','frequency']

所以以下内容及时运行:

So the following ran in a timely fashion:

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

但是,这需要很长时间才能运行:

However, this is taking an unexpected long time to run:

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

我在这里做错了什么?有没有更好的方法来计算大型数据帧中的出现次数?

What am I doing wrong here? Is there a better way to count occurences in a large dataframe?

df.word.describe()

运行得很好,所以我真的没想到这个 Occurrences_of_Words 数据框需要很长时间来构建.

ran pretty well, so I really did not expect this Occurrences_of_Words dataframe to take very long to build.

ps:如果答案是显而易见的,并且你觉得有必要惩罚我提出这个问题,请附上答案.谢谢.

ps: If the answer is obvious and you feel the need to penalize me for asking this question, please include the answer as well. thank you.

推荐答案

我认为 df['word'].value_counts() 应该服务.通过跳过 groupby 机制,您将节省一些时间.我不确定为什么 count 应该比 max 慢得多.两者都需要一些时间来避免缺失值.(对比size.)

I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)

无论如何,value_counts 已经专门优化来处理对象类型,比如你的话,所以我怀疑你会做得更好.

In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.

这篇关于计算 pandas 出现次数的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆