如何提高pandas GroupBy过滤操作的性能? [英] How do I improve the performance of pandas GroupBy filter operation?

查看:30
本文介绍了如何提高pandas GroupBy过滤操作的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我第一次提问.

我正在处理一个大型 CSV 数据集(它包含超过 1500 万行,大小超过 1.5 GB).

I'm working with a large CSV dataset (it contains over 15 million rows and is over 1.5 GB in size).

我正在将提取物加载到在 Jupyter Notebooks 中运行的 Pandas 数据帧中,以根据数据集导出算法.我按 MAC 地址对数据进行分组,结果为 1+ 百万组.

I'm loading the extracts into Pandas dataframes running in Jupyter Notebooks to derive an algorithm based on the dataset. I group the data by MAC address, which results in 1+ million groups.

我的算法开发的核心是运行这个操作:

Core to my algorithm development is running this operation:

pandas.core.groupby.DataFrameGroupBy.filter

运行此操作需要 3 到 5 分钟,具体取决于数据集.为了开发这个算法,我必须执行数百次甚至数千次这个操作.

Running this operation takes 3 to 5 minutes, depending on the data set. To develop this algorithm, I must execute this operation hundreds, perhaps thousands of times.

此操作似乎受 CPU 限制,并且仅使用我机器上可用的几个内核之一.我花了几个小时在线研究潜在的解决方案.我尝试同时使用 numbadask 来加速此操作,但两次尝试都导致异常.

This operation appears to be CPU bound and only uses one of several cores available on my machine. I spent a few hours researching potential solutions online. I've tried to use both numba and dask to accelerate this operation and both attempts resulted in exceptions.

Numba 提供了一条消息,大意是这不应该发生,感谢您帮助改进产品".看来,Dask 可能没有实现 DataFrameGroupBy.filter 操作.我无法确定如何重写我的代码以使用 pool/map.

Numba provided a message to the effect of "this should not have happened, thank you for helping improve the product". Dask, it appears, may not implement the DataFrameGroupBy.filter operation. I could not determine how to re-write my code to use pool/map.

我正在寻找有关如何加速此操作的建议:

I'm looking for suggestions on how to accelerate this operation:

pandas.core.groupby.DataFrameGroupBy.filter

这是我的代码中此操作的示例.还有其他示例,所有示例的执行时间似乎都差不多.

Here is an example of this operation in my code. There are other examples, all of which seem to have about the same execution time.

import pandas as pd

def import_data(_file, _columns):
    df = pd.read_csv(_file, low_memory = False)
    df[_columns] = df[_columns].apply(pd.to_numeric, errors='coerce')
    df = df.sort_values(by=['mac', 'time'])
    # The line below takes ~3 to 5 minutes to run
    df = df.groupby(['mac']).filter(lambda x: x['latency'].count() > 1)
    return df

我怎样才能加快速度?

推荐答案

filterGroupBy 一起使用时通常会很慢.如果您尝试根据 GroupBy 中的条件过滤 DataFrame,更好的选择是使用 transformmap:

filter is generally known to be slow when used with GroupBy. If you are trying to filter a DataFrame based on a conditional inside a GroupBy, a better alternative is to use transform or map:

df[df.groupby('mac')['latency'].transform('count').gt(1)]

df[df['mac'].map(df.groupby('mac')['latency'].count()).gt(1)]

这篇关于如何提高pandas GroupBy过滤操作的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆