选择每个组的最大行- pandas 性能问题 [英] Select the max row per group - pandas performance issue

查看:63
本文介绍了选择每个组的最大行- pandas 性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为每个组选择一个最大行,并且正在使用groupby/agg返回索引值并使用loc选择行.

I'm selecting one max row per group and I'm using groupby/agg to return index values and select the rows using loc.

例如,按"Id"分组,然后选择具有最大"delta"值的行:

For example, to group by "Id" and then select the row with the highest "delta" value:

selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
selected_rows = df.loc[selected_idx, :]

但是,这种方式太慢了.实际上,当我在1300万行上使用此查询时,我的i7/16G RAM笔记本电脑会挂起.

However, it's so slow this way. Actually, my i7/16G RAM laptop hangs when I'm using this query on 13 million rows.

我有两个问题要问专家:

I have two questions for experts:

  1. 如何使此查询在熊猫中快速运行?我在做什么错了?
  2. 为什么这个手术这么贵?


[更新] 非常感谢@unutbu的分析! sort_drop是!在我的i7/32GRAM机器上,groupby + idxmax挂起了将近14个小时(从不返回任何东西),但是sort_drop处理不到一分钟!


[Update] Thank you so much for @unutbu 's analysis! sort_drop it is! On my i7/32GRAM machine, groupby+idxmax hangs for nearly 14 hours (never return a thing) however sort_drop handled it LESS THAN A MINUTE!

我仍然需要看一下熊猫如何实现每种方法,但是现在已经解决了问题!我喜欢StackOverflow.

I still need to look at how pandas implements each method but problems solved for now! I love StackOverflow.

推荐答案

最快的选择不仅取决于DataFrame的长度(在这种情况下,大约为13M行),还取决于组的数量.以下是对每组中找到最大值的几种方法进行比较的表现:

The fastest option depends not only on length of the DataFrame (in this case, around 13M rows) but also on the number of groups. Below are perfplots which compare a number of ways of finding the maximum in each group:

如果只有几个(大)组,则using_idxmax可能是最快的选择:

If there an only a few (large) groups, using_idxmax may be the fastest option:

如果有很多(小)组,并且DataFrame不太大,则using_sort_drop可能是最快的选择:

If there are many (small) groups and the DataFrame is not too large, using_sort_drop may be the fastest option:

但是请记住,虽然using_sort_dropusing_sortusing_rank开始看起来非常快,但随着N = len(df)的增加,它们相对于其他选项的速度会很快消失. 对于足够大的N,即使有很多组,using_idxmax也将成为最快的选择.

Keep in mind, however, that while using_sort_drop, using_sort and using_rank start out looking very fast, as N = len(df) increases, their speed relative to the other options disappears quickly. For large enough N, using_idxmax becomes the fastest option, even if there are many groups.

using_sort_dropusing_sortusing_rank对DataFrame(或DataFrame中的组)进行排序.排序平均为O(N * log(N)),而其他方法使用O(N)操作.这就是为什么using_idxmax之类的方法在大型DataFrame中胜过using_sort_drop的原因.

using_sort_drop, using_sort and using_rank sorts the DataFrame (or groups within the DataFrame). Sorting is O(N * log(N)) on average, while the other methods use O(N) operations. This is why methods like using_idxmax beats using_sort_drop for very large DataFrames.

请注意,基准测试结果可能因多种原因而有所不同,包括机器规格,操作系统和软件版本.因此,在您自己的计算机上运行基准测试以及针对您的情况量身定制的测试数据非常重要.

Be aware that benchmark results may vary for a number of reasons, including machine specs, OS, and software versions. So it is important to run benchmarks on your own machine, and with test data tailored to your situation.

基于上述性能,对于您的1千3百万行的DataFrame,尤其是如果它有许多(小)组的情况,using_sort_drop 可能是一个值得考虑的选项.否则,我会怀疑using_idxmax是最快的选择-但是,再次重要的是,您必须检查计算机上的基准测试.

Based on the perfplots above, using_sort_drop may be an option worth considering for your DataFrame of 13M rows, especially if it has many (small) groups. Otherwise, I would suspect using_idxmax to be the fastest option -- but again, it's important that you check benchmarks on your machine.

这是我用来制作 perfplots 的设置:

Here is the setup I used to make the perfplots:

import numpy as np
import pandas as pd 
import perfplot

def make_df(N):
    # lots of small groups
    df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
    # few large groups
    # df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
    return df


def using_idxmax(df):
    return df.loc[df.groupby("Id")['delta'].idxmax()]

def max_mask(s):
    i = np.asarray(s).argmax()
    result = [False]*len(s)
    result[i] = True
    return result

def using_custom_mask(df):
    mask = df.groupby("Id")['delta'].transform(max_mask)
    return df.loc[mask]

def using_isin(df):
    idx = df.groupby("Id")['delta'].idxmax()
    mask = df.index.isin(idx)
    return df.loc[mask]

def using_sort(df):
    df = df.sort_values(by=['delta'], ascending=False, kind='mergesort')
    return df.groupby('Id', as_index=False).first()

def using_rank(df):
    mask = (df.groupby('Id')['delta'].rank(method='first', ascending=False) == 1)
    return df.loc[mask]

def using_sort_drop(df):
    # Thanks to jezrael
    # https://stackoverflow.com/questions/50381064/select-the-max-row-per-group-pandas-performance-issue/50389889?noredirect=1#comment87795818_50389889
    return df.sort_values(by=['delta'], ascending=False, kind='mergesort').drop_duplicates('Id')

def using_apply(df):
    selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
    return df.loc[selected_idx]

def check(df1, df2):
    df1 = df1.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
    df2 = df2.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
    return df1.equals(df2)

perfplot.show(
    setup=make_df,
    kernels=[using_idxmax, using_custom_mask, using_isin, using_sort, 
             using_rank, using_apply, using_sort_drop],
    n_range=[2**k for k in range(2, 20)],
    logx=True,
    logy=True,
    xlabel='len(df)',
    repeat=75,
    equality_check=check)


另一种基准测试方法是使用 IPython%timeit :

In [55]:  df = make_df(2**20)

In [56]: %timeit using_sort_drop(df)
1 loop, best of 3: 403 ms per loop

In [57]: %timeit using_rank(df)
1 loop, best of 3: 1.04 s per loop

In [58]: %timeit using_idxmax(df)
1 loop, best of 3: 15.8 s per loop

这篇关于选择每个组的最大行- pandas 性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆