如何提高Pandas GroupBy筛选器操作的性能? [英] How do I improve the performance of pandas GroupBy filter operation?

查看:88
本文介绍了如何提高Pandas GroupBy筛选器操作的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我第一次问一个问题.

This is my first time asking a question.

我正在处理一个大型CSV数据集(它包含超过1500万行,并且大小超过1.5 GB).

I'm working with a large CSV dataset (it contains over 15 million rows and is over 1.5 GB in size).

我正在将摘录加载到在Jupyter Notebooks中运行的Pandas数据帧中,以基于数据集导出算法.我按MAC地址对数据进行分组,结果得出1+百万个分组.

I'm loading the extracts into Pandas dataframes running in Jupyter Notebooks to derive an algorithm based on the dataset. I group the data by MAC address, which results in 1+ million groups.

我算法开发的核心正在运行此操作:

Core to my algorithm development is running this operation:

pandas.core.groupby.DataFrameGroupBy.filter

运行此操作需要3到5分钟,具体取决于数据集.要开发此算法,我必须执行此操作数百次,甚至数千次.

Running this operation takes 3 to 5 minutes, depending on the data set. To develop this algorithm, I must execute this operation hundreds, perhaps thousands of times.

此操作似乎受CPU限制,并且仅使用计算机上可用的几个内核之一.我花了几个小时在线研究潜在的解决方案.我尝试同时使用 numba dask 来加速此操作,并且两次尝试均导致异常.

This operation appears to be CPU bound and only uses one of several cores available on my machine. I spent a few hours researching potential solutions online. I've tried to use both numba and dask to accelerate this operation and both attempts resulted in exceptions.

Numba提供了一条信息,表示这不应该发生,谢谢您帮助改进产品".看起来Dask可能未实现DataFrameGroupBy.filter操作.我无法确定如何重新编写代码以使用 pool / map .

Numba provided a message to the effect of "this should not have happened, thank you for helping improve the product". Dask, it appears, may not implement the DataFrameGroupBy.filter operation. I could not determine how to re-write my code to use pool/map.

我正在寻找有关如何加快此操作的建议:

I'm looking for suggestions on how to accelerate this operation:

pandas.core.groupby.DataFrameGroupBy.filter

这是我的代码中此操作的示例.还有其他示例,所有这些示例似乎都具有相同的执行时间.

Here is an example of this operation in my code. There are other examples, all of which seem to have about the same execution time.

import pandas as pd

def import_data(_file, _columns):
    df = pd.read_csv(_file, low_memory = False)
    df[_columns] = df[_columns].apply(pd.to_numeric, errors='coerce')
    df = df.sort_values(by=['mac', 'time'])
    # The line below takes ~3 to 5 minutes to run
    df = df.groupby(['mac']).filter(lambda x: x['latency'].count() > 1)
    return df

我该如何加快速度?

推荐答案

filter GroupBy 一起使用时速度较慢.如果您尝试基于GroupBy内部的条件过滤DataFrame,则更好的选择是使用 transform map :

filter is generally known to be slow when used with GroupBy. If you are trying to filter a DataFrame based on a conditional inside a GroupBy, a better alternative is to use transform or map:

df[df.groupby('mac')['latency'].transform('count').gt(1)]

df[df['mac'].map(df.groupby('mac')['latency'].count()).gt(1)]

这篇关于如何提高Pandas GroupBy筛选器操作的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆