移位MultiIndex时间序列的最有效方法 [英] Most efficient way to shift MultiIndex time series

查看:49
本文介绍了移位MultiIndex时间序列的最有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由许多堆叠时间序列组成的DataFrame.索引是(poolId,month),两者都是整数,"month"是自2000年以来的月数.计算多个变量的一个月滞后版本的最佳方法是什么?

I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?

现在,我做类似的事情:

Right now, I do something like:

cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))

对于我的数据,这花了我整整60秒钟的时间. (我有48k个不同的池,总共有718k行.)

For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)

我正在从R代码和等效的data.table调用中进行转换:

I'm converting this from R code and the equivalent data.table call:

dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]

只需9秒钟即可运行. (这里的"myshift"类似于"function(x)c(x [-1],NA)".)

only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)

有没有办法让熊猫Verison快速回到生产线上?我在0.8.1.上进行了测试.

Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.

这是一个生成足够接近的数据集的示例,因此您可以对我的意思有所了解:

Here's an example of generating a close-enough data set, so you can get some idea of what I mean:

ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))

当我尝试时花了64 s.此数据的每个系列都从第0个月开始;确实,它们都应该在np.max(lens)月份结束,开始日期参差不齐,但是足够好.

That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.

这是一些比较R代码.这需要0.8 s. 80倍,不好.

Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.

library(data.table)
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])

推荐答案

我建议您重塑数据并相对于groupby方法进行单班制:

I would suggest you reshape the data and do a single shift versus the groupby approach:

result = df.unstack(0).shift(1).stack()

这会切换级别的顺序,以便您要交换和重新排序:

This switches the order of the levels so you'd want to swap and reorder:

result = result.swaplevel(0, 1).sortlevel(0)

您可以验证它已经滞后了一个周期(您希望使用shift(1)而不是shift(-1)):

You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):

In [17]: result.ix[1]
Out[17]: 
              a         b         c         d         e
month                                                  
1      0.752511  0.600825  0.328796  0.852869  0.306379
2      0.251120  0.871167  0.977606  0.509303  0.809407
3      0.198327  0.587066  0.778885  0.565666  0.172045
4      0.298184  0.853896  0.164485  0.169562  0.923817
5      0.703668  0.852304  0.030534  0.415467  0.663602
6      0.851866  0.629567  0.918303  0.205008  0.970033
7      0.758121  0.066677  0.433014  0.005454  0.338596
8      0.561382  0.968078  0.586736  0.817569  0.842106
9      0.246986  0.829720  0.522371  0.854840  0.887886
10     0.709550  0.591733  0.919168  0.568988  0.849380
11     0.997787  0.084709  0.664845  0.808106  0.872628
12     0.008661  0.449826  0.841896  0.307360  0.092581
13     0.727409  0.791167  0.518371  0.691875  0.095718
14     0.928342  0.247725  0.754204  0.468484  0.663773
15     0.934902  0.692837  0.367644  0.061359  0.381885
16     0.828492  0.026166  0.050765  0.524551  0.296122
17     0.589907  0.775721  0.061765  0.033213  0.793401
18     0.532189  0.678184  0.747391  0.199283  0.349949

In [18]: df.ix[1]
Out[18]: 
              a         b         c         d         e
month                                                  
0      0.752511  0.600825  0.328796  0.852869  0.306379
1      0.251120  0.871167  0.977606  0.509303  0.809407
2      0.198327  0.587066  0.778885  0.565666  0.172045
3      0.298184  0.853896  0.164485  0.169562  0.923817
4      0.703668  0.852304  0.030534  0.415467  0.663602
5      0.851866  0.629567  0.918303  0.205008  0.970033
6      0.758121  0.066677  0.433014  0.005454  0.338596
7      0.561382  0.968078  0.586736  0.817569  0.842106
8      0.246986  0.829720  0.522371  0.854840  0.887886
9      0.709550  0.591733  0.919168  0.568988  0.849380
10     0.997787  0.084709  0.664845  0.808106  0.872628
11     0.008661  0.449826  0.841896  0.307360  0.092581
12     0.727409  0.791167  0.518371  0.691875  0.095718
13     0.928342  0.247725  0.754204  0.468484  0.663773
14     0.934902  0.692837  0.367644  0.061359  0.381885
15     0.828492  0.026166  0.050765  0.524551  0.296122
16     0.589907  0.775721  0.061765  0.033213  0.793401
17     0.532189  0.678184  0.747391  0.199283  0.349949

这种方法的性能还不错(在0.9.0中可能会慢一些):

Perf isn't too bad with this method (it might be a touch slower in 0.9.0):

In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s

这篇关于移位MultiIndex时间序列的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆