向量化Pandas DataFrame上的迭代函数 [英] Vectorizing an iterative function on Pandas DataFrame
问题描述
我有一个数据框,其中第一行是初始条件.
I have a dataframe where the first row is the initial condition.
df = pd.DataFrame({"Year": np.arange(4),
"Pop": [0.4] + [np.nan]* 3})
和函数 f(x,r)= r * x *(1-x)
,其中 r = 2
是常数, 0< = x< = 1
.
and a function f(x,r) = r*x*(1-x)
, where r = 2
is a constant and 0 <= x <= 1
.
我想通过逐行将函数应用于 Pop
列来产生以下数据帧.即 df.Pop [i] = f(df.Pop [i-1],r = 2)
I want to produce the following dataframe by applying the function to column Pop
row-by-row iteratively. I.e., df.Pop[i] = f(df.Pop[i-1], r=2)
df = pd.DataFrame({"Year": np.arange(4),
"Pop": [0.4, 0.48, 4992, 0.49999872]})
问题:是否可以通过矢量化方式实现?
我可以通过使用循环为x和y值构建列表来达到预期的结果,但这不是向量化的.
I can achieve the desired result by using a loop to build lists for the x and y values, but this is not vectorized.
我也试过这个,但是所有nan
的地方都是0.48
.
I have also tried this, but all nan
places are filled with 0.48
.
df.loc[1:, "Pop"] = R * df.Pop[:-1] * (1 - df.Pop[:-1])
推荐答案
以矢量化的方式进行此操作是不可能的.
It is IMPOSSIBLE to do this in a vectorized way.
根据定义,矢量化利用并行处理来减少执行时间.但是问题中的期望值必须按顺序顺序计算,而不是按 parallel 顺序计算.有关详细说明,请参见此答案.像 df.expanding(2).apply之类的东西(f)和 df.Rolling(2).apply(f)无法正常工作.
By definition, vectorization makes use of parallel processing to reduce execution time. But the desired values in your question must be computed in sequential order, not in parallel. See this answer for detailed explanation. Things like df.expanding(2).apply(f) and df.rolling(2).apply(f) won't work.
但是,可以提高效率.您可以使用生成器进行迭代.这是用于实现迭代过程的非常常见的构造.
However, gaining more efficiency is possible. You can do the iteration using a generator. This is a very common construct for implementing iterative processes.
def gen(x_init, n, R=2):
x = x_init
for _ in range(n):
x = R * x * (1-x)
yield x
# execute
df.loc[1:, "Pop"] = list(gen(df.at[0, "Pop"], len(df) - 1))
结果:
print(df)
Pop
0 0.400000
1 0.480000
2 0.499200
3 0.499999
对于小尺寸数据,在这里停止是完全可以的.但是,如果要多次执行该功能,则可以考虑使用
It is completely OK to stop here for small-sized data. If the function is going to be performed a lot of times, however, you can consider optimizing the generator with numba.
- 首先在控制台中
-
pip install numba
或conda install numba
-
导入numba
- 在生成器前面添加装饰器
@ numba.njit
.
pip install numba
orconda install numba
in the console firstimport numba
- Add decorator
@numba.njit
in front of the generator.
将 np.nan
的数量更改为10 ^ 6,然后自己检查执行时间的差异.我的Core-i5 8250U 64位笔记本电脑从468ms改善到217ms.
Change the number of np.nan
s to 10^6 and check out the difference in execution time yourself. An improvement from 468ms to 217ms was achieved on my Core-i5 8250U 64bit laptop.
这篇关于向量化Pandas DataFrame上的迭代函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!