如何提高pandas GroupBy过滤操作的性能? [英] How do I improve the performance of pandas GroupBy filter operation?

查看：30 发布时间：2021/12/27 8:10:43 python pandas group-by pandas-groupby

本文介绍了如何提高pandas GroupBy过滤操作的性能?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我第一次提问.

我正在处理一个大型 CSV 数据集(它包含超过 1500 万行，大小超过 1.5 GB).

I'm working with a large CSV dataset (it contains over 15 million rows and is over 1.5 GB in size).

我正在将提取物加载到在 Jupyter Notebooks 中运行的 Pandas 数据帧中，以根据数据集导出算法.我按 MAC 地址对数据进行分组，结果为 1+ 百万组.

I'm loading the extracts into Pandas dataframes running in Jupyter Notebooks to derive an algorithm based on the dataset. I group the data by MAC address, which results in 1+ million groups.

我的算法开发的核心是运行这个操作:

Core to my algorithm development is running this operation:

pandas.core.groupby.DataFrameGroupBy.filter

运行此操作需要 3 到 5 分钟，具体取决于数据集.为了开发这个算法，我必须执行数百次甚至数千次这个操作.

Running this operation takes 3 to 5 minutes, depending on the data set. To develop this algorithm, I must execute this operation hundreds, perhaps thousands of times.

此操作似乎受 CPU 限制，并且仅使用我机器上可用的几个内核之一.我花了几个小时在线研究潜在的解决方案.我尝试同时使用 numba 和 dask 来加速此操作，但两次尝试都导致异常.

This operation appears to be CPU bound and only uses one of several cores available on my machine. I spent a few hours researching potential solutions online. I've tried to use both numba and dask to accelerate this operation and both attempts resulted in exceptions.

Numba 提供了一条消息，大意是这不应该发生，感谢您帮助改进产品".看来，Dask 可能没有实现 DataFrameGroupBy.filter 操作.我无法确定如何重写我的代码以使用 pool/map.

Numba provided a message to the effect of "this should not have happened, thank you for helping improve the product". Dask, it appears, may not implement the DataFrameGroupBy.filter operation. I could not determine how to re-write my code to use pool/map.

我正在寻找有关如何加速此操作的建议:

I'm looking for suggestions on how to accelerate this operation:

pandas.core.groupby.DataFrameGroupBy.filter

这是我的代码中此操作的示例.还有其他示例，所有示例的执行时间似乎都差不多.

Here is an example of this operation in my code. There are other examples, all of which seem to have about the same execution time.

import pandas as pd

def import_data(_file, _columns):
    df = pd.read_csv(_file, low_memory = False)
    df[_columns] = df[_columns].apply(pd.to_numeric, errors='coerce')
    df = df.sort_values(by=['mac', 'time'])
    # The line below takes ~3 to 5 minutes to run
    df = df.groupby(['mac']).filter(lambda x: x['latency'].count() > 1)
    return df

我怎样才能加快速度?

如何提高pandas GroupBy过滤操作的性能? [英] How do I improve the performance of pandas GroupBy filter operation?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何提高pandas GroupBy过滤操作的性能? [英] How do I improve the performance of pandas GroupBy filter operation?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭