Pandas 的性能应用 vs np.vectorize 从现有列创建新列 [英] Performance of Pandas apply vs np.vectorize to create new column from existing columns

查看:20
本文介绍了Pandas 的性能应用 vs np.vectorize 从现有列创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Pandas 数据框并希望创建一个新列作为现有列的函数.我还没有看到关于 df.apply()np.vectorize() 之间速度差异的很好的讨论,所以我想我会在这里问.

I am using Pandas dataframes and want to create a new column as a function of existing columns. I have not seen a good discussion of the speed difference between df.apply() and np.vectorize(), so I thought I would ask here.

Pandas apply() 函数很慢.根据我的测量(在下面的一些实验中显示),使用 np.vectorize() 比使用 DataFrame 函数 apply() 快 25 倍(或更多),至少在我的 2016 款 MacBook Pro 上.这是预期的结果吗?为什么?

The Pandas apply() function is slow. From what I measured (shown below in some experiments), using np.vectorize() is 25x faster (or more) than using the DataFrame function apply() , at least on my 2016 MacBook Pro. Is this an expected result, and why?

例如,假设我有以下带有 N 行的数据框:

For example, suppose I have the following dataframe with N rows:

N = 10
A_list = np.random.randint(1, 100, N)
B_list = np.random.randint(1, 100, N)
df = pd.DataFrame({'A': A_list, 'B': B_list})
df.head()
#     A   B
# 0  78  50
# 1  23  91
# 2  55  62
# 3  82  64
# 4  99  80

进一步假设我想创建一个新列作为 AB 两列的函数.在下面的示例中,我将使用一个简单的函数 divide().要应用该函数,我可以使用 df.apply()np.vectorize():

Suppose further that I want to create a new column as a function of the two columns A and B. In the example below, I'll use a simple function divide(). To apply the function, I can use either df.apply() or np.vectorize():

def divide(a, b):
    if b == 0:
        return 0.0
    return float(a)/b

df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)

df['result2'] = np.vectorize(divide)(df['A'], df['B'])

df.head()
#     A   B    result   result2
# 0  78  50  1.560000  1.560000
# 1  23  91  0.252747  0.252747
# 2  55  62  0.887097  0.887097
# 3  82  64  1.281250  1.281250
# 4  99  80  1.237500  1.237500

如果我将 N 增加到 100 万或更多的实际大小,那么我观察到 np.vectorize() 快 25 倍或更多df.apply().

If I increase N to real-world sizes like 1 million or more, then I observe that np.vectorize() is 25x faster or more than df.apply().

以下是一些完整的基准测试代码:

Below is some complete benchmarking code:

import pandas as pd
import numpy as np
import time

def divide(a, b):
    if b == 0:
        return 0.0
    return float(a)/b

for N in [1000, 10000, 100000, 1000000, 10000000]:    

    print ''
    A_list = np.random.randint(1, 100, N)
    B_list = np.random.randint(1, 100, N)
    df = pd.DataFrame({'A': A_list, 'B': B_list})

    start_epoch_sec = int(time.time())
    df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)
    end_epoch_sec = int(time.time())
    result_apply = end_epoch_sec - start_epoch_sec

    start_epoch_sec = int(time.time())
    df['result2'] = np.vectorize(divide)(df['A'], df['B'])
    end_epoch_sec = int(time.time())
    result_vectorize = end_epoch_sec - start_epoch_sec


    print 'N=%d, df.apply: %d sec, np.vectorize: %d sec' % \
            (N, result_apply, result_vectorize)

    # Make sure results from df.apply and np.vectorize match.
    assert(df['result'].equals(df['result2']))

结果如下:

N=1000, df.apply: 0 sec, np.vectorize: 0 sec

N=10000, df.apply: 1 sec, np.vectorize: 0 sec

N=100000, df.apply: 2 sec, np.vectorize: 0 sec

N=1000000, df.apply: 24 sec, np.vectorize: 1 sec

N=10000000, df.apply: 262 sec, np.vectorize: 4 sec

如果 np.vectorize() 通常总是比 df.apply() 快,那么为什么 np.vectorize()没有提到更多?我只看到过与 df.apply() 相关的 StackOverflow 帖子,例如:

If np.vectorize() is in general always faster than df.apply(), then why is np.vectorize() not mentioned more? I only ever see StackOverflow posts related to df.apply(), such as:

pandas 根据其他人的值创建新列列

如何使用 Pandas 的应用"功能多列?

如何将一个函数应用于两个Pandas 数据框的列

推荐答案

我将开始说 Pandas 和 NumPy 数组的强大功能源自高性能矢量化 数值数组的计算.1矢量化计算的全部意义在于通过将计算转移到高度优化的 C 代码并利用连续的内存块来避免 Python 级循环.2

I will start by saying that the power of Pandas and NumPy arrays is derived from high-performance vectorised calculations on numeric arrays.1 The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.2

现在我们可以看看一些时间.下面是所有 Python 级循环,它们生成 pd.Seriesnp.ndarraylist 对象,其中包含相同的值.为了分配给数据框中的系列,结果具有可比性.

Now we can look at some timings. Below are all Python-level loops which produce either pd.Series, np.ndarray or list objects containing the same values. For the purposes of assignment to a series within a dataframe, the results are comparable.

# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0

np.random.seed(0)
N = 10**5

%timeit list(map(divide, df['A'], df['B']))                                   # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B'])                                # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])]                      # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)]     # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True)                  # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1)              # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()]  # 11.6 s

一些要点:

  1. 基于 tuple 的方法(前 4 个)比基于 pd.Series 的方法(后 3 个)更有效.
  2. np.vectorize、list comprehension + zipmap 方法,即前 3 种方法的性能大致相同.这是因为他们使用 tuple 绕过了 pd.DataFrame.itertuples 的一些 Pandas 开销.
  3. raw=Truepd.DataFrame.apply 一起使用与不使用相比,速度有显着提高.此选项将 NumPy 数组提供给自定义函数,而不是 pd.Series 对象.
  1. The tuple-based methods (the first 4) are a factor more efficient than pd.Series-based methods (the last 3).
  2. np.vectorize, list comprehension + zip and map methods, i.e. the top 3, all have roughly the same performance. This is because they use tuple and bypass some Pandas overhead from pd.DataFrame.itertuples.
  3. There is a significant speed improvement from using raw=True with pd.DataFrame.apply versus without. This option feeds NumPy arrays to the custom function instead of pd.Series objects.

pd.DataFrame.apply:只是另一个循环

准确地查看 Pandas 传递的对象,您可以简单地修改您的函数:

pd.DataFrame.apply: just another loop

To see exactly the objects Pandas passes around, you can amend your function trivially:

def foo(row):
    print(type(row))
    assert False  # because you only need to see this once
df.apply(lambda row: foo(row), axis=1)

输出:.相对于 NumPy 数组,创建、传递和查询 Pandas 系列对象会带来大量开销.这应该不足为奇:Pandas 系列包括相当数量的脚手架来保存索引、值、属性等.

Output: <class 'pandas.core.series.Series'>. Creating, passing and querying a Pandas series object carries significant overheads relative to NumPy arrays. This shouldn't be surprise: Pandas series include a decent amount of scaffolding to hold an index, values, attributes, etc.

raw=True 再次做同样的练习,你会看到 .所有这些都在文档中进行了描述,但看到它更有说服力.

Do the same exercise again with raw=True and you'll see <class 'numpy.ndarray'>. All this is described in the docs, but seeing it is more convincing.

<的文档code>np.vectorize 有以下注释:

The docs for np.vectorize has the following note:

向量化函数对 pyfunc 的连续元组求值输入数组类似于 python map 函数,除了它使用numpy的广播规则.

The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

广播规则"在这里无关紧要,因为输入数组具有相同的维度.与 map 的平行是有指导意义的,因为上面的 map 版本具有几乎相同的性能.源代码展示了什么发生:np.vectorize 将您的输入函数转换为 通用函数(ufunc")通过np.frompyfunc.有一些优化,例如缓存,这可以带来一些性能改进.

The "broadcasting rules" are irrelevant here, since the input arrays have the same dimensions. The parallel to map is instructive, since the map version above has almost identical performance. The source code shows what's happening: np.vectorize converts your input function into a Universal function ("ufunc") via np.frompyfunc. There is some optimisation, e.g. caching, which can lead to some performance improvement.

简而言之,np.vectorize 做了 Python 级循环应该做的事情,但是 pd.DataFrame.apply 增加了大量开销.没有使用 numba 看到的 JIT 编译(见下文).只是为了方便.

In short, np.vectorize does what a Python-level loop should do, but pd.DataFrame.apply adds a chunky overhead. There's no JIT-compilation which you see with numba (see below). It's just a convenience.

为什么没有在任何地方提到上述差异?因为真正矢量化计算的性能使它们变得无关紧要:

Why aren't the above differences mentioned anywhere? Because the performance of truly vectorised calculations make them irrelevant:

%timeit np.where(df['B'] == 0, 0, df['A'] / df['B'])       # 1.17 ms
%timeit (df['A'] / df['B']).replace([np.inf, -np.inf], 0)  # 1.96 ms

是的,这比上述循环解决方案中最快的解决方案快约 40 倍.其中任何一个都是可以接受的.在我看来,第一个是简洁、可读和高效的.只看其他方法,例如numba 下面,如果性能很关键,这是你的瓶颈的一部分.

Yes, that's ~40x faster than the fastest of the above loopy solutions. Either of these are acceptable. In my opinion, the first is succinct, readable and efficient. Only look at other methods, e.g. numba below, if performance is critical and this is part of your bottleneck.

当循环被认为是可行的时,它们通常通过numba与底层的NumPy数组进行优化,以尽可能多地移动到C.

When loops are considered viable they are usually optimised via numba with underlying NumPy arrays to move as much as possible to C.

确实,numba 将性能提高到 微秒.如果没有一些繁琐的工作,将很难获得比这更高的效率.

Indeed, numba improves performance to microseconds. Without some cumbersome work, it will be difficult to get much more efficient than this.

from numba import njit

@njit
def divide(a, b):
    res = np.empty(a.shape)
    for i in range(len(a)):
        if b[i] != 0:
            res[i] = a[i] / b[i]
        else:
            res[i] = 0
    return res

%timeit divide(df['A'].values, df['B'].values)  # 717 µs

使用 @njit(parallel=True) 可以为更大的数组提供进一步的提升.

Using @njit(parallel=True) may provide a further boost for larger arrays.

1 数字类型包括:intfloatdatetimebool, category.它们排除 object 数据类型并且可以保存在连续的内存块中.

1 Numeric types include: int, float, datetime, bool, category. They exclude object dtype and can be held in contiguous memory blocks.

2NumPy 操作比 Python 高效的原因至少有两个:

2 There are at least 2 reasons why NumPy operations are efficient versus Python:

  • Python 中的一切都是对象.与 C 不同,这包括数字.因此,Python 类型具有原生 C 类型不存在的开销.
  • NumPy 方法通常基于 C.此外,优化算法尽可能使用.

这篇关于Pandas 的性能应用 vs np.vectorize 从现有列创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆