pandas :如何更快地在数据框上应用? [英] Pandas: How to make apply on dataframe faster?

查看:70
本文介绍了 pandas :如何更快地在数据框上应用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑下面的熊猫示例,在该示例中,如果使用applylambda函数满足特定条件,则将ABfloat相乘来计算列C:

Consider this pandas example where I'm calculating column C by multiplying A with B and a float if a certain condition is fulfilled using apply with a lambda function:

import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9],'B':[9,8,7,6,5,4,3,2,1]})

df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)

预期结果将是:

   A  B    C
0  1  9  1.0
1  2  8  2.0
2  3  7  3.0
3  4  6  4.0
4  5  5  2.5
5  6  4  2.4
6  7  3  2.1
7  8  2  1.6
8  9  1  0.9

问题在于此代码运行缓慢,我需要在具有约5600万行的数据帧上执行此操作.

The problem is that this code is slow and I need to do this operation on a dataframe with around 56 million rows.

上述lambda操作的%timeit结果是:

The %timeit-result of the above lambda operation is:

1000 loops, best of 3: 1.63 ms per loop

从计算时间开始,以及在大型数据帧上执行此操作时的内存使用情况,我假定此操作在进行计算时会使用中间级数.

Going from the calculation time and also the memory usage when doing this on my large dataframe I presume this operation uses intermediary series while doing the calculations.

我试图用不同的方式来表达它,包括使用临时列,但是我想出的每一个替代解决方案都更慢.

I tried to formulate it in different ways including using temporary columns, but every alternative solution I came up with is even slower.

是否有一种方法可以以其他更快的方式获得所需的结果,例如通过使用numpy?

Is there a way to get the result I need in a different and faster way, e.g. by using numpy?

推荐答案

为了提高性能,您最好使用NumPy数组并使用np.where-

For performance, you might be better off working with NumPy array and using np.where -

a = df.values # Assuming you have two columns A and B
df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])


运行时测试

def numpy_based(df):
    a = df.values # Assuming you have two columns A and B
    df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])

时间-

In [271]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [272]: %timeit numpy_based(df)
1000 loops, best of 3: 380 µs per loop

In [273]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [274]: %timeit df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.39 ms per loop

In [275]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [276]: %timeit df['C'] = np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 1.12 ms per loop

In [277]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [278]: %timeit df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 1.19 ms per loop


近距离观察

让我们仔细看看NumPy的数字处理能力,并与熊猫进行比较-

Let's take a closer look at NumPy's number crunching capability and compare with pandas into the mix -

# Extract out as array (its a view, so not really expensive
#   .. as compared to the later computations themselves)

In [291]: a = df.values 

In [296]: %timeit df.values
10000 loops, best of 3: 107 µs per loop

案例1:使用NumPy数组并使用numpy.where:

Case #1 : Work with NumPy array and use numpy.where :

In [292]: %timeit np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
10000 loops, best of 3: 86.5 µs per loop

同样,将其分配到新列中:df['C']也不会很昂贵-

Again, assigning into a new column : df['C'] would not be very expensive either -

In [300]: %timeit df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
1000 loops, best of 3: 323 µs per loop

案例2:处理pandas数据框并使用其.where方法(无NumPy)

Case #2 : Work with pandas dataframe and use its .where method (no NumPy)

In [293]: %timeit df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.4 ms per loop

案例3:使用pandas数据框(无NumPy数组),但使用numpy.where-

Case #3 : Work with pandas dataframe (no NumPy array), but use numpy.where -

In [294]: %timeit np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 764 µs per loop

案例4:再次使用pandas数据框(不使用NumPy数组),但使用numpy.where-

Case #4 : Work with pandas dataframe again (no NumPy array), but use numpy.where -

In [295]: %timeit np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 830 µs per loop

这篇关于 pandas :如何更快地在数据框上应用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆