pandas 数据帧的矢量化计算 [英] Vectorize calculation of a Pandas Dataframe
问题描述
我有一个微不足道的问题,已经使用循环解决了,但是我试图看看是否有一种方法可以尝试将其中的一些向量化以提高性能.
I have a trivial problem that I have solved using loops, but I am trying to see if there is a way I can attempt to vectorize some of it to try and improve performance.
基本上,我有2个数据帧(DF_A和DF_B),其中DF_B中的行基于DF_A中相应行与DF_B中上面行的总和.我在DF_B中确实有第一行值.
Essentially I have 2 dataframes (DF_A and DF_B), where the rows in DF_B are based on a sumation of a corresponding row in DF_A and the row above in DF_B. I do have the first row of values in DF_B.
df_a = [
[1,2,3,4]
[5,6,7,8]
[..... more rows]
]
df_b = [
[1,2,3,4]
[ rows of all 0 values here, so dimensions match df_a]
]
我要达到的目标是,例如df_b中的第二行将是df_b中第一行的值+ df_a中第二行的值.因此,在这种情况下:
What I am trying to achive is that the 2nd row in df_b for example will be the values of the first row in df_b + the values of the second row in df_a. So in this case:
df_b.loc[2] = [6,8,10,12]
我能够使用df_a范围内的循环来完成此操作,保留先前的行值,然后将当前索引的行添加到先前的行值.似乎效率不高.
I was able to accomplish this using a loop over range of df_a, keeping the previous rows value saved off and then adding the row of the current index to the previous rows value. Doesn't seem super efficient.
推荐答案
这是numpy
解决方案.这应该比pandas
循环快得多,尤其是因为它通过numba
使用JIT编译.
Here is a numpy
solution. This should be significantly faster than a pandas
loop, especially since it uses JIT-compiling via numba
.
from numba import jit
a = df_a.values
b = df_b.values
@jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
df_b = pd.DataFrame(fill_b(a, b))
# 0 1 2 3
# 0 1 2 3 4
# 1 6 8 10 12
# 2 15 18 21 24
# 3 28 32 36 40
# 4 45 50 55 60
性能基准测试
import pandas as pd, numpy as np
from numba import jit
df_a = pd.DataFrame(np.arange(1,1000001).reshape(1000,1000))
@jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
def jp(df_a):
a = df_a.values
b = np.empty(df_a.values.shape)
b[0] = np.arange(1, 1001)
return pd.DataFrame(fill_b(a, b))
%timeit df_a.cumsum() # 16.1 ms
%timeit jp(df_a) # 6.05 ms
这篇关于 pandas 数据帧的矢量化计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!