可以使用哪些技术来衡量 pandas/numpy 解决方案的性能 [英] What techniques can be used to measure performance of pandas/numpy solutions

查看:15
本文介绍了可以使用哪些技术来衡量 pandas/numpy 解决方案的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我如何以简洁全面的方式衡量以下各种功能的性能.

示例

考虑数据帧 df

df = pd.DataFrame({'组':列表('QLCKPXNLNTIXAWYMWACA'),'值': [29, 52, 71, 51, 45, 76, 68, 60, 92, 95,99, 27, 77, 54, 39, 23, 84, 37, 99, 87]})

我想对 Group 中按不同值分组的 Value 列进行汇总.我有三种方法可以做到.

将pandas导入为pd将 numpy 导入为 np从 numba 导入 njitdef sum_pd(df):返回 df.groupby('Group').Value.sum()def sum_fc(df):f, u = pd.factorize(df.Group.values)v = df.Value.valuesreturn pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index()@njitdef wbcnt(b, w, k):bins = np.arange(k)垃圾箱 = 垃圾箱 * 0对于范围内的 i(len(b)):bins[b[i]] += w[i]回收箱def sum_nb(df):b, u = pd.factorize(df.Group.values)w = df.Value.valuesbins = wbcnt(b, w, u.size)返回 pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index()

它们是一样的吗?

print(sum_pd(df).equals(sum_nb(df)))打印(sum_pd(df).等于(sum_fc(df)))真的真的

他们有多快?

%timeit sum_pd(df)%timeit sum_fc(df)%timeit sum_nb(df)1000 个循环,最好的 3 个:每个循环 536 µs1000 个循环,最好的 3 个:每个循环 324 µs1000 个循环,最好的 3 个:每个循环 300 µs

解决方案

他们可能不会归类为简单框架",因为它们是需要安装的第三方模块,但有两个我经常使用的框架:

  • 如果函数在运行时非常相似,百分比差异而不是绝对数字可能更重要:

    r.plot_difference_percentage(relative_to=sum_nb)

    或者以 DataFrame 的形式获取基准测试的时间(这需要 pandas)

    r.to_pandas_dataframe()

     sum_pd sum_fc sum_nb16 0.000796 0.000515 0.00050232 0.000702 0.000453 0.00045464 0.000702 0.000454 0.000456128 0.000711 0.000456 0.000458256 0.000714 0.000461 0.000462512 0.000728 0.000471 0.0004731024 0.000746 0.000512 0.0005132048 0.000825 0.000515 0.0005144096 0.000902 0.000609 0.0006408192 0.001056 0.000731 0.00075516384 0.001381 0.001012 0.00093632768 0.001885 0.001465 0.00132865536 0.003404 0.002957 0.002585131072 0.008076 0.005668 0.005159262144 0.015532 0.011059 0.010988524288 0.032517 0.023336 0.0186081048576 0.055144 0.040367 0.0354872097152 0.112333 0.080407 0.072154

    如果您不喜欢装饰器,您也可以在一次调用中设置所有内容(在这种情况下,您不需要 BenchmarkBuilderadd_function/<代码>add_arguments 装饰器):

    from simple_benchmark 导入基准r = benchmark([sum_pd, sum_fc, sum_nb], {2**i: creator(2**i) for i in range(4, 22)}, "Rows in DataFrame")

    这里 perfplot 提供了一个非常相似的界面(和结果):

    import perfplotr = perfplot.bench(设置=创建者,内核=[sum_pd, sum_fc, sum_nb],n_range=[2**k for k in range(4, 22)],xlabel='DataFrame 中的行数',)导入 matplotlib.pyplot 作为 pltplt.loglog()r.plot()

    Question

    How do I measure the performance of the various functions below in a concise and comprehensive way.

    Example

    Consider the dataframe df

    df = pd.DataFrame({
            'Group': list('QLCKPXNLNTIXAWYMWACA'),
            'Value': [29, 52, 71, 51, 45, 76, 68, 60, 92, 95,
                      99, 27, 77, 54, 39, 23, 84, 37, 99, 87]
        })
    

    I want to sum up the Value column grouped by distinct values in Group. I have three methods for doing it.

    import pandas as pd
    import numpy as np
    from numba import njit
    
    
    def sum_pd(df):
        return df.groupby('Group').Value.sum()
    
    def sum_fc(df):
        f, u = pd.factorize(df.Group.values)
        v = df.Value.values
        return pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index()
    
    @njit
    def wbcnt(b, w, k):
        bins = np.arange(k)
        bins = bins * 0
        for i in range(len(b)):
            bins[b[i]] += w[i]
        return bins
    
    def sum_nb(df):
        b, u = pd.factorize(df.Group.values)
        w = df.Value.values
        bins = wbcnt(b, w, u.size)
        return pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index()
    

    Are they the same?

    print(sum_pd(df).equals(sum_nb(df)))
    print(sum_pd(df).equals(sum_fc(df)))
    
    True
    True
    

    How fast are they?

    %timeit sum_pd(df)
    %timeit sum_fc(df)
    %timeit sum_nb(df)
    
    1000 loops, best of 3: 536 µs per loop
    1000 loops, best of 3: 324 µs per loop
    1000 loops, best of 3: 300 µs per loop
    

    解决方案

    They might not classify as "simple frameworks" because they are third-party modules that need to be installed but there are two frameworks I often use:

    For example the simple_benchmark library allows to decorate the functions to benchmark:

    from simple_benchmark import BenchmarkBuilder
    b = BenchmarkBuilder()
    
    import pandas as pd
    import numpy as np
    from numba import njit
    
    @b.add_function()
    def sum_pd(df):
        return df.groupby('Group').Value.sum()
    
    @b.add_function()
    def sum_fc(df):
        f, u = pd.factorize(df.Group.values)
        v = df.Value.values
        return pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index()
    
    @njit
    def wbcnt(b, w, k):
        bins = np.arange(k)
        bins = bins * 0
        for i in range(len(b)):
            bins[b[i]] += w[i]
        return bins
    
    @b.add_function()
    def sum_nb(df):
        b, u = pd.factorize(df.Group.values)
        w = df.Value.values
        bins = wbcnt(b, w, u.size)
        return pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index()
    

    Also decorate a function that produces the values for the benchmark:

    from string import ascii_uppercase
    
    def creator(n):  # taken from another answer here
        letters = list(ascii_uppercase)
        np.random.seed([3,1415])
        df = pd.DataFrame(dict(
                Group=np.random.choice(letters, n),
                Value=np.random.randint(100, size=n)
            ))
        return df
    
    @b.add_arguments('Rows in DataFrame')
    def argument_provider():
        for exponent in range(4, 22):
            size = 2**exponent
            yield size, creator(size)
    

    And then all you need to run the benchmark is:

    r = b.run()
    

    After that you can inspect the results as plot (you need the matplotlib library for this):

    r.plot()
    

    In case the functions are very similar in run-time the percentage difference instead of absolute numbers could be more important:

    r.plot_difference_percentage(relative_to=sum_nb) 
    

    Or get the times for the benchmark as DataFrame (this needs pandas)

    r.to_pandas_dataframe()
    

               sum_pd    sum_fc    sum_nb
    16       0.000796  0.000515  0.000502
    32       0.000702  0.000453  0.000454
    64       0.000702  0.000454  0.000456
    128      0.000711  0.000456  0.000458
    256      0.000714  0.000461  0.000462
    512      0.000728  0.000471  0.000473
    1024     0.000746  0.000512  0.000513
    2048     0.000825  0.000515  0.000514
    4096     0.000902  0.000609  0.000640
    8192     0.001056  0.000731  0.000755
    16384    0.001381  0.001012  0.000936
    32768    0.001885  0.001465  0.001328
    65536    0.003404  0.002957  0.002585
    131072   0.008076  0.005668  0.005159
    262144   0.015532  0.011059  0.010988
    524288   0.032517  0.023336  0.018608
    1048576  0.055144  0.040367  0.035487
    2097152  0.112333  0.080407  0.072154
    

    In case you don't like the decorators you could also setup everything in one call (in that case you don't need the BenchmarkBuilder and the add_function/add_arguments decorators):

    from simple_benchmark import benchmark
    r = benchmark([sum_pd, sum_fc, sum_nb], {2**i: creator(2**i) for i in range(4, 22)}, "Rows in DataFrame")
    

    Here perfplot offers a very similar interface (and result):

    import perfplot
    r = perfplot.bench(
        setup=creator,
        kernels=[sum_pd, sum_fc, sum_nb],
        n_range=[2**k for k in range(4, 22)],
        xlabel='Rows in DataFrame',
        )
    import matplotlib.pyplot as plt
    plt.loglog()
    r.plot()
    

    这篇关于可以使用哪些技术来衡量 pandas/numpy 解决方案的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆