Pandas的性能适用于vs.np.vectorize从现有列创建新列 [英] Performance of Pandas apply vs np.vectorize to create new column from existing columns

查看:86
本文介绍了Pandas的性能适用于vs.np.vectorize从现有列创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Pandas数据框,并希望根据现有列创建一个新列.我还没有很好地讨论df.apply()np.vectorize()之间的速度差异,所以我想在这里问一下.

I am using Pandas dataframes and want to create a new column as a function of existing columns. I have not seen a good discussion of the speed difference between df.apply() and np.vectorize(), so I thought I would ask here.

熊猫apply()功能慢.根据我的测量(在某些实验中显示如下),至少在我的2016 MacBook Pro上,使用np.vectorize()的速度比使用DataFrame函数apply()的速度快25倍(或更多). 这是预期的结果吗?为什么?

The Pandas apply() function is slow. From what I measured (shown below in some experiments), using np.vectorize() is 25x faster (or more) than using the DataFrame function apply() , at least on my 2016 MacBook Pro. Is this an expected result, and why?

例如,假设我有以下带有N行的数据框:

For example, suppose I have the following dataframe with N rows:

N = 10
A_list = np.random.randint(1, 100, N)
B_list = np.random.randint(1, 100, N)
df = pd.DataFrame({'A': A_list, 'B': B_list})
df.head()
#     A   B
# 0  78  50
# 1  23  91
# 2  55  62
# 3  82  64
# 4  99  80

进一步假设我想根据两个列AB创建一个新列.在下面的示例中,我将使用一个简单的函数divide().要应用该功能,我可以使用df.apply()np.vectorize():

Suppose further that I want to create a new column as a function of the two columns A and B. In the example below, I'll use a simple function divide(). To apply the function, I can use either df.apply() or np.vectorize():

def divide(a, b):
    if b == 0:
        return 0.0
    return float(a)/b

df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)

df['result2'] = np.vectorize(divide)(df['A'], df['B'])

df.head()
#     A   B    result   result2
# 0  78  50  1.560000  1.560000
# 1  23  91  0.252747  0.252747
# 2  55  62  0.887097  0.887097
# 3  82  64  1.281250  1.281250
# 4  99  80  1.237500  1.237500

如果我将N增加到现实世界的大小(例如100万或更多),那么我会发现np.vectorize()快25倍或比df.apply()大.

If I increase N to real-world sizes like 1 million or more, then I observe that np.vectorize() is 25x faster or more than df.apply().

下面是一些完整的基准测试代码:

Below is some complete benchmarking code:

import pandas as pd
import numpy as np
import time

def divide(a, b):
    if b == 0:
        return 0.0
    return float(a)/b

for N in [1000, 10000, 100000, 1000000, 10000000]:    

    print ''
    A_list = np.random.randint(1, 100, N)
    B_list = np.random.randint(1, 100, N)
    df = pd.DataFrame({'A': A_list, 'B': B_list})

    start_epoch_sec = int(time.time())
    df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)
    end_epoch_sec = int(time.time())
    result_apply = end_epoch_sec - start_epoch_sec

    start_epoch_sec = int(time.time())
    df['result2'] = np.vectorize(divide)(df['A'], df['B'])
    end_epoch_sec = int(time.time())
    result_vectorize = end_epoch_sec - start_epoch_sec


    print 'N=%d, df.apply: %d sec, np.vectorize: %d sec' % \
            (N, result_apply, result_vectorize)

    # Make sure results from df.apply and np.vectorize match.
    assert(df['result'].equals(df['result2']))

结果如下所示:

N=1000, df.apply: 0 sec, np.vectorize: 0 sec

N=10000, df.apply: 1 sec, np.vectorize: 0 sec

N=100000, df.apply: 2 sec, np.vectorize: 0 sec

N=1000000, df.apply: 24 sec, np.vectorize: 1 sec

N=10000000, df.apply: 262 sec, np.vectorize: 4 sec

如果np.vectorize()通常总是比df.apply()快,那么为什么np.vectorize()没有被更多提及?我只看过与df.apply()相关的StackOverflow帖子,例如:

If np.vectorize() is in general always faster than df.apply(), then why is np.vectorize() not mentioned more? I only ever see StackOverflow posts related to df.apply(), such as:

熊猫基于其他值创建新列列

如何使用熊猫的"apply"功能到多列?

如何将函数应用于两个熊猫数据框的列

推荐答案

我将开始说起Pandas和NumPy数组的力量来自于高性能的 vectorized 1 向量化计算的整个要点是通过将计算移至高度优化的C代码并利用连续的内存块来避免Python级循环. 2

I will start by saying that the power of Pandas and NumPy arrays is derived from high-performance vectorised calculations on numeric arrays.1 The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.2

现在我们可以看看一些时间.以下是所有的Python级循环,它们会生成包含相同值的pd.Seriesnp.ndarraylist对象.为了分配给数据框内的序列,结果是可比较的.

Now we can look at some timings. Below are all Python-level loops which produce either pd.Series, np.ndarray or list objects containing the same values. For the purposes of assignment to a series within a dataframe, the results are comparable.

# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0

np.random.seed(0)
N = 10**5

%timeit list(map(divide, df['A'], df['B']))                                   # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B'])                                # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])]                      # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)]     # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True)                  # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1)              # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()]  # 11.6 s

一些要点:

  1. 基于tuple的方法(前4种)比基于pd.Series的方法(后3种)效率更高.
  2. np.vectorize,列表理解+ zipmap方法(即前3名)的性能大致相同.这是因为他们使用tuple 并且绕过了pd.DataFrame.itertuples的一些熊猫开销.
  3. pd.DataFrame.applypd.DataFrame.apply相比,使用raw=True可以显着提高速度.此选项将NumPy数组提供给自定义函数,而不是pd.Series对象.
  1. The tuple-based methods (the first 4) are a factor more efficient than pd.Series-based methods (the last 3).
  2. np.vectorize, list comprehension + zip and map methods, i.e. the top 3, all have roughly the same performance. This is because they use tuple and bypass some Pandas overhead from pd.DataFrame.itertuples.
  3. There is a significant speed improvement from using raw=True with pd.DataFrame.apply versus without. This option feeds NumPy arrays to the custom function instead of pd.Series objects.

pd.DataFrame.apply:只是另一个循环

要准确地查看熊猫传出的对象,可以对函数进行微不足道的修改:

pd.DataFrame.apply: just another loop

To see exactly the objects Pandas passes around, you can amend your function trivially:

def foo(row):
    print(type(row))
    assert False  # because you only need to see this once
df.apply(lambda row: foo(row), axis=1)

输出:<class 'pandas.core.series.Series'>.相对于NumPy数组,创建,传递和查询Pandas系列对象会带来大量开销.这不足为奇:Pandas系列包含相当数量的脚手架,用于存放索引,值,属性等.

Output: <class 'pandas.core.series.Series'>. Creating, passing and querying a Pandas series object carries significant overheads relative to NumPy arrays. This shouldn't be surprise: Pandas series include a decent amount of scaffolding to hold an index, values, attributes, etc.

使用raw=True再次执行相同的练习,您将看到<class 'numpy.ndarray'>.所有这些都在文档中进行了描述,但是看到它更令人信服.

Do the same exercise again with raw=True and you'll see <class 'numpy.ndarray'>. All this is described in the docs, but seeing it is more convincing.

np.vectorize 具有以下注释:

The docs for np.vectorize has the following note:

向量化函数在的连续元组上计算pyfunc 输入数组类似于python map函数,但它使用了 numpy的广播规则.

The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

此处的广播规则"无关紧要,因为输入数组的维数相同.与map的并行具有指导意义,因为上面的map版本具有几乎相同的性能. 源代码显示了什么发生:np.vectorize将您的输入函数转换为通用函数("ufunc")通过 np.frompyfunc .有一些优化,例如缓存,可以提高性能.

The "broadcasting rules" are irrelevant here, since the input arrays have the same dimensions. The parallel to map is instructive, since the map version above has almost identical performance. The source code shows what's happening: np.vectorize converts your input function into a Universal function ("ufunc") via np.frompyfunc. There is some optimisation, e.g. caching, which can lead to some performance improvement.

简而言之,np.vectorize做了Python级循环 应该做的事情,但是pd.DataFrame.apply增加了大块的开销.您没有使用 numba 看到的JIT编译(请参见下文). 很方便.

In short, np.vectorize does what a Python-level loop should do, but pd.DataFrame.apply adds a chunky overhead. There's no JIT-compilation which you see with numba (see below). It's just a convenience.

为什么在任何地方都没有提到上述差异?因为真正矢量化计算的性能使它们无关紧要:

Why aren't the above differences mentioned anywhere? Because the performance of truly vectorised calculations make them irrelevant:

%timeit np.where(df['B'] == 0, 0, df['A'] / df['B'])       # 1.17 ms
%timeit (df['A'] / df['B']).replace([np.inf, -np.inf], 0)  # 1.96 ms

是的,这比上述循环式解决方案中最快的速度快40倍.这些中的任何一个都是可以接受的.在我看来,第一个是简洁,可读和高效的.仅查看其他方法,例如numba,如果性能至关重要,这是您的瓶颈之一.

Yes, that's ~40x faster than the fastest of the above loopy solutions. Either of these are acceptable. In my opinion, the first is succinct, readable and efficient. Only look at other methods, e.g. numba below, if performance is critical and this is part of your bottleneck.

当循环 被认为可行时,通常通过numba使用底层的NumPy数组对其进行优化,以尽可能多地移至C.

When loops are considered viable they are usually optimised via numba with underlying NumPy arrays to move as much as possible to C.

实际上,numba将性能提高到了 microseconds .没有一些繁琐的工作,将很难获得比这更高的效率.

Indeed, numba improves performance to microseconds. Without some cumbersome work, it will be difficult to get much more efficient than this.

from numba import njit

@njit
def divide(a, b):
    res = np.empty(a.shape)
    for i in range(len(a)):
        if b[i] != 0:
            res[i] = a[i] / b[i]
        else:
            res[i] = 0
    return res

%timeit divide(df['A'].values, df['B'].values)  # 717 µs

使用@njit(parallel=True)可能会为更大的阵列提供进一步的推动力.

Using @njit(parallel=True) may provide a further boost for larger arrays.

1 数值类型包括:intfloatdatetimeboolcategory.它们排除 object dtype,可以保存在连续的内存块中.

1 Numeric types include: int, float, datetime, bool, category. They exclude object dtype and can be held in contiguous memory blocks.

2 NumPy操作相对于Python高效的原因至少有两个:

2 There are at least 2 reasons why NumPy operations are efficient versus Python:

  • Python中的所有对象都是对象.与C不同,这包括数字.因此,Python类型的开销是本机C类型所不具备的.
  • NumPy方法通常基于C.此外,优化的算法 尽可能使用.
  • Everything in Python is an object. This includes, unlike C, numbers. Python types therefore have an overhead which does not exist with native C types.
  • NumPy methods are usually C-based. In addition, optimised algorithms are used where possible.

这篇关于Pandas的性能适用于vs.np.vectorize从现有列创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆