pandas :可变权重的指数衰减总和 [英] Pandas: Exponentially decaying sum with variable weights

查看:167
本文介绍了 pandas :可变权重的指数衰减总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

类似于此问题 Python Pandas DataFrame上的指数衰减,我会希望快速计算数据帧中某些列的指数衰减总和.但是,数据帧中的行在时间上并不是均匀间隔的.因此,在exponential_sum[i] = column_to_sum[i] + np.exp(-const*(time[i]-time[i-1])) * exponential_sum[i-1]时,权重np.exp(...)并未被排除,对我来说,如何更改该问题并仍然利用pandas/numpy向量化仍然不是很明显.有熊猫矢量化解决方案吗?

Similar to this question Exponential Decay on Python Pandas DataFrame, I would like to quickly compute exponentially decaying sums for some columns in a data frame. However, the rows in the data frame are not evenly spaced in time. Hence while exponential_sum[i] = column_to_sum[i] + np.exp(-const*(time[i]-time[i-1])) * exponential_sum[i-1], the weight np.exp(...) does not factor out and it's not obvious to me how to change to that question and still take advantage of pandas/numpy vectorization. Is there a pandas vectorized solution to this problem?

为说明所需的计算,这是一个示例框架,其中使用衰减常数1:Sum存储了A的指数移动总和.

To illustrate the desired calculation, here is a sample frame with the exponential moving sum of A stored in Sum using a decay constant of 1:

    time  A       Sum
0   1.00  1  1.000000
1   2.10  3  3.332871
2   2.13 -1  2.234370
3   3.70  7  7.464850
4  10.00  2  2.013708
5  10.20  1  2.648684

推荐答案

这个问题比最初出现的要复杂得多.我最终使用numba的jit编译了一个生成器函数来计算指数和.我的最终结果是在我的计算机上在一秒钟内计算出500万行的指数总和,希望该速度足以满足您的需求.

This question is more complicated than it first appeared. I ended up using numba's jit to compile a generator function to calculate the exponential sums. My end result calculates the exponential sum of 5 million rows in under a second on my computer, which hopefully is fast enough for your needs.

# Initial dataframe.
df = pd.DataFrame({'time': [1, 2.1, 2.13, 3.7, 10, 10.2], 
                   'A': [1, 3, -1, 7, 2, 1]})

# Initial decay parameter.
decay_constant = 1

我们可以将衰减权重定义为exp(-time_delta *朽木常数),并将其初始值设置为等于一:

We can define the decay weights as exp(-time_delta * decay_constant), and set its initial value equal to one:

df['weight'] = np.exp(-df.time.diff() * decay_constant)
df.weight.iat[0] = 1

>>> df
   A   time    weight
0  1   1.00  1.000000
1  3   2.10  0.332871
2 -1   2.13  0.970446
3  7   3.70  0.208045
4  2  10.00  0.001836
5  1  10.20  0.818731

现在,我们将使用 numba 中的jit来优化生成指数函数的生成器函数:

Now we'll use jit from numba to optimize a generator function that calculates the exponential sums:

from numba import jit

@jit(nopython=True)
def exponential_sum(A, k):
    total = A[0]
    yield total
    for i in xrange(1, len(A)):  # Use range in Python 3.
        total = total * k[i] + A[i]
        yield total

我们将使用生成器将值添加到数据框:

We'll use the generator to add the values to the dataframe:

df['expSum'] = list(exponential_sum(df.A.values, df.weight.values))

哪个会产生所需的输出:

Which produces the desired output:

>>> df
   A   time    weight    expSum
0  1   1.00  1.000000  1.000000
1  3   2.10  0.332871  3.332871
2 -1   2.13  0.970446  2.234370
3  7   3.70  0.208045  7.464850
4  2  10.00  0.001836  2.013708
5  1  10.20  0.818731  2.648684

因此,让我们扩展到500万行并检查性能:

So let's scale to 5 million rows and check performance:

df = pd.DataFrame({'time': np.random.rand(5e6).cumsum(), 'A': np.random.randint(1, 10, 5e6)})
df['weight'] = np.exp(-df.time.diff() * decay_constant)
df.weight.iat[0] = 1

%%timeit -n 10 
df['expSum'] = list(exponential_sum(df.A.values, df.weight.values))
10 loops, best of 3: 726 ms per loop

这篇关于 pandas :可变权重的指数衰减总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆