向量化Pandas DataFrame上的迭代函数 [英] Vectorizing an iterative function on Pandas DataFrame

查看:80
本文介绍了向量化Pandas DataFrame上的迭代函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中第一行是初始条件.

I have a dataframe where the first row is the initial condition.

df = pd.DataFrame({"Year": np.arange(4),
                   "Pop": [0.4] + [np.nan]* 3})

和函数 f(x,r)= r * x *(1-x),其中 r = 2 是常数, 0< = x< = 1 .

and a function f(x,r) = r*x*(1-x), where r = 2 is a constant and 0 <= x <= 1.

我想通过逐行将函数应用于 Pop 列来产生以下数据帧.即 df.Pop [i] = f(df.Pop [i-1],r = 2)

I want to produce the following dataframe by applying the function to column Pop row-by-row iteratively. I.e., df.Pop[i] = f(df.Pop[i-1], r=2)

df = pd.DataFrame({"Year": np.arange(4),
                   "Pop": [0.4, 0.48, 4992, 0.49999872]})

问题:是否可以通过矢量化方式实现?

我可以通过使用循环为x和y值构建列表来达到预期的结果,但这不是向量化的.

I can achieve the desired result by using a loop to build lists for the x and y values, but this is not vectorized.

我也试过这个,但是所有nan的地方都是0.48.

I have also tried this, but all nan places are filled with 0.48.

df.loc[1:, "Pop"] = R * df.Pop[:-1] * (1 - df.Pop[:-1])

推荐答案

以矢量化的方式进行此操作是不可能的.

It is IMPOSSIBLE to do this in a vectorized way.

根据定义,矢量化利用并行处理来减少执行时间.但是问题中的期望值必须按顺序顺序计算,而不是按 parallel 顺序计算.有关详细说明,请参见此答案.像 df.expanding(2).apply之类的东西(f) df.Rolling(2).apply(f)无法正常工作.

By definition, vectorization makes use of parallel processing to reduce execution time. But the desired values in your question must be computed in sequential order, not in parallel. See this answer for detailed explanation. Things like df.expanding(2).apply(f) and df.rolling(2).apply(f) won't work.

但是,可以提高效率.您可以使用生成器进行迭代.这是用于实现迭代过程的非常常见的构造.

However, gaining more efficiency is possible. You can do the iteration using a generator. This is a very common construct for implementing iterative processes.

def gen(x_init, n, R=2):
    x = x_init
    for _ in range(n):
        x = R * x * (1-x)
        yield x

# execute            
df.loc[1:, "Pop"] = list(gen(df.at[0, "Pop"], len(df) - 1))

结果:

print(df)
        Pop
0  0.400000
1  0.480000
2  0.499200
3  0.499999

对于小尺寸数据,在这里停止是完全可以的.但是,如果要多次执行该功能,则可以考虑使用

It is completely OK to stop here for small-sized data. If the function is going to be performed a lot of times, however, you can consider optimizing the generator with numba.

    首先在控制台中
  • pip install numba conda install numba
  • 导入numba
  • 在生成器前面添加装饰器 @ numba.njit .
  • pip install numba or conda install numba in the console first
  • import numba
  • Add decorator @numba.njit in front of the generator.

np.nan 的数量更改为10 ^ 6,然后自己检查执行时间的差异.我的Core-i5 8250U 64位笔记本电脑从468ms改善到217ms.

Change the number of np.nans to 10^6 and check out the difference in execution time yourself. An improvement from 468ms to 217ms was achieved on my Core-i5 8250U 64bit laptop.

这篇关于向量化Pandas DataFrame上的迭代函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆