对于“大型"数据,在 pandas 数据帧中的过滤速度较慢.组数? [英] Filtering in Pandas dataframe slow for "large" number of groups?

查看:66
本文介绍了对于“大型"数据,在 pandas 数据帧中的过滤速度较慢.组数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个约有20万行的数据框,我尝试按以下方式进行过滤:

I have a dataframe with about 200k rows, which I'm trying to filter as follows:

>>> df.groupby(key).filter(lambda group: len(group) > 100)

其中key是列的列表.当指定的键将数据帧分为800个左右的组时,此过程将在3秒钟内运行.但是,如果我在键中添加另一列,将组数增加到2500个左右,那么执行将占用我的所有内存,并且除非我终止脚本,否则基本上会使系统崩溃.

where key is a list of columns. This runs in about 3 seconds when the key specified divides the dataframe into 800 or so groups. However, if I add another column to the key, increasing the number of groups to around 2500, the execution sucks up all my memory and basically crashes my system unless I terminate the script.

我可以通过遍历各个组来执行相同的操作,但是与上述单行代码相比,它很笨拙,这使我想知道为什么过滤器功能如此有限.

I can do the same by iterating over the groups, but it's clumsy compared to the above one-liner, and makes me wonder why the filter function is so limited.

有人可以向我解释一下这是否可以预期吗?如果可以,为什么?

Could someone please explain to me if this is to be expected, and if so why?

谢谢!

推荐答案

我找到了解决方案.列之一包含我表示为时间戳记"对象的日期.当我将Timestamp对象转换为字符串时,分组工作很快,没有任何问题!

I found a solution. One of the columns contains dates that I have represented as Timestamp objects. When I convert the Timestamp objects to strings, the grouping works quickly without problems!

这篇关于对于“大型"数据,在 pandas 数据帧中的过滤速度较慢.组数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆