滑动窗口上的 pandas 滚动计算(不均匀分布) [英] Pandas Rolling Computations on Sliding Windows (Unevenly spaced)

查看:72
本文介绍了滑动窗口上的 pandas 滚动计算(不均匀分布)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑到您有一些不均匀的时间序列数据:

Consider you've got some unevenly time series data:

import pandas as pd
import random as randy
ts = pd.Series(range(1000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),1000)).sort_index()
print ts.head()


2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499
2013-02-01 09:00:00.003838    797
2013-02-01 09:00:00.004727    295
2013-02-01 09:00:00.006287    253

比方说,我想在1ms的时间内进行滚动求和:

Let's say I wanted to do the rolling sum over a 1ms window to get this:

2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499 + 995
2013-02-01 09:00:00.003838    797 + 499 + 995
2013-02-01 09:00:00.004727    295 + 797 + 499
2013-02-01 09:00:00.006287    253

目前,我将所有内容都放回多头,并在cython中进行,但是在纯熊猫中有可能吗?我知道您可以执行.asfreq('U')之类的操作,然后填充并使用传统函数,但是一旦行数超过玩具数量,就无法缩放.

Currently, I cast everything back to longs and do this in cython, but is this possible in pure pandas? I'm aware that you can do something like .asfreq('U') and then fill and use the traditional functions but this doesn't scale once you've got more than a toy # of rows.

作为参考,这是一个骇人的,不是快速的Cython版本:

As a point of reference, here's a hackish, not fast Cython version:

%%cython
import numpy as np
cimport cython
cimport numpy as np

ctypedef np.double_t DTYPE_t

def rolling_sum_cython(np.ndarray[long,ndim=1] times, np.ndarray[double,ndim=1] to_add, long window_size):
    cdef long t_len = times.shape[0], s_len = to_add.shape[0], i =0, win_size = window_size, t_diff, j, window_start
    cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(t_len, dtype=np.double)
    assert(t_len==s_len)
    for i in range(0,t_len):
        window_start = times[i] - win_size
        j = i
        while times[j]>= window_start and j>=0:
            res[i] += to_add[j]
            j-=1
    return res   

在稍大的系列中对此进行演示:

Demonstrating this on a slightly larger series:

ts = pd.Series(range(100000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e8,freq='U'),100000)).sort_index()

%%timeit
res2 = rolling_sum_cython(ts.index.astype(int64),ts.values.astype(double),long(1e6))
1000 loops, best of 3: 1.56 ms per loop

推荐答案

您可以使用求和和二进制搜索解决大多数此类问题.

You can solve most problems of this sort with cumsum and binary search.

from datetime import timedelta

def msum(s, lag_in_ms):
    lag = s.index - timedelta(milliseconds=lag_in_ms)
    inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
    cs = s.cumsum()
    return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)

res = msum(ts, 100)
print pd.DataFrame({'a': ts, 'a_msum_100': res})


                            a  a_msum_100
2013-02-01 09:00:00.073479  5           5
2013-02-01 09:00:00.083717  8          13
2013-02-01 09:00:00.162707  1          14
2013-02-01 09:00:00.171809  6          20
2013-02-01 09:00:00.240111  7          14
2013-02-01 09:00:00.258455  0          14
2013-02-01 09:00:00.336564  2           9
2013-02-01 09:00:00.536416  3           3
2013-02-01 09:00:00.632439  4           7
2013-02-01 09:00:00.789746  9           9

[10 rows x 2 columns]

您需要一种处理NaN的方法,并且根据您的应用程序,您可能需要延迟时间前后的通用值(即,使用kdb + bin与np.searchsorted之间的差异).

You need a way of handling NaNs and depending on your application, you may need the prevailing value asof the lagged time or not (ie difference between using kdb+ bin vs np.searchsorted).

希望这会有所帮助.

这篇关于滑动窗口上的 pandas 滚动计算(不均匀分布)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆