将不规则时间序列划分为每月平均值 - R [英] Splitting irregular time series into regular monthly averages - R

查看:349
本文介绍了将不规则时间序列划分为每月平均值 - R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了建立对能源使用的季节性影响,我需要将来自计费数据库的能源使用信息与每月温度一致。



使用具有不同长度和开始和结束日期的帐单的结算数据集,并且我想获得每个月内每个帐户的月平均值。例如,我有一个具有以下特征的结算数据库:

  acct amount begin end days 
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35

喜欢弄清楚如何强制这些有些不规则的时间序列(对于每个帐户),以获得在每个帐单内跨越每个月内的每天的平均金额,例如:

  acct amount begin end days avgamtpday 
1 2242 11349 2009-10-01 2009-10-31 31 X
2 2242 12252 2009-11-01 2009- 11-30 30 X
3 2242 21774 2009-12-01 2010-12-31 31 X
4 2242 18293 2010-01-01 2010-01-31 31 X
4 2242 18293 2010-02-01 2010-02-28 28 X
5 2243 27217 2009-10-01 2009-10-31 31 X
6 2243 117 2009-11-01 2009-11-30 30 X
7 2243 14543 2009-12-01 2009-12-31 30 X
7 2243 14543 2010-01-01 2010-01-31 31 X

对于任何一个工具都可以做到这一点,我是相当不可知的,因为我只需要这样做一次。



一个额外的皱纹是表是大约150,000行长,这是不是真的非常大,大多数标准,但足够大,使循环解决方案在R困难。我研究了使用z中的动物园,xts和tempdisagg包。我开始写一个真正丑陋的循环,将拆分每个账单,然后在现有账单中为每个月创建一行,然后tapply()通过accts



在MySQL中,我尝试过:


创建或替换view v3为select 1 n union all select 1 union all select 1;

创建或替换视图v作为从v3中选择1 n, v3 b union all select 1;

设置@n = 0;

删除表如果存在日历;
创建表日历(dt日期主键);

插入日历

select cast('2008-1-1'+ interval @n:= @ n + 1天作为日期)作为dt
从va,vb,vc,vd,ve,v;



select acct,amount,begin,end,billAmtPerDay,sum(billAmtPerDay),MonthAmt,
count()Days,sum(billAmtPerDay)/计数()AverageAmtPerDay,年(dt),月(dt)
FROM(select *,amount / days billAmtPerDay
from bills b
inner join calendar c on dt between begin和结束并开始<> dt)x
group by acct,amount,begin,end,billAmtPerDay,year(dt),month(dt);


但是由于我不明白,我的服务器不喜欢这个表,并挂在内部连接,即使我在不​​同的计算阶段。

解决方案



/ div>

以下是使用 data.table 开始的开始:

  billdata<  -  read.table(text =acct amount begin end days 
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12 -04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10- 06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35,sep =, header = TRUE,row.names = 1)

require(data.table)
DT = as.data.table(billdata)

首先,更改列类型 begin end 到日期。与data.frame不同,这不会复制整个数据集。

  DT [,begin:= as.Date(begin) ] 
DT [,end:= as.Date(end)]

时间跨度,找到每一天的当前账单,并汇总。

  alldays = DT [,seq(min max(end),by =day)] 

setkey(DT,acct,begin)

DT [CJ(unique(acct),alldays),
mean(amount / days,na.rm = TRUE),
by = list(acct,month = format(begin,%Y-%m)),roll = TRUE]

acct month V1
1:2242 2009-10 391.34483
2:2242 2009-11 406.69448
3:2242 2009-12 601.43226
4:2242 2010-01 646.27465
5:2242 2010-02 653.32143
6:2243 2009-10 938.51724
7:2243 2009-11 97.36172
8:2243 2009-12 375.68065
9: 2243 2010-01 415.51429
10:2243 2010-02 415.51429

找到当前的连接逻辑在SQL中相当繁琐,而且速度较慢。



我说这是一个提示,因为它不是很正确。通知行10重复,因为帐户2243不伸展到2010-02不像帐户2242.要完成它,你可以 rbind 在每个帐户的最后一行,并使用 rolltolast 而不是 roll 。或者可以通过帐户而不是跨所有帐户创建 alldays



查看上述速度是否可以接受,我们可以从那里开始。



这可能会在1.8.2中遇到一个已经在1.8.3中修复的bug。我使用的是v1.8.3。


Internal错误消息,当结合包含缺失组的组和组by
是固定的,#2162。例如:
X [Y,.N,by = NonJoinColumn]
其中Y包含一些与X不匹配的行。这个错误也可能导致seg
错误。 / p>

让我知道,我们可以解决,或从R-Forge升级到1.8.3。



Btw,漂亮的示例数据。






这是上面提到的完整答案。这有点棘手,我不得不承认,因为它结合了 data.table 的几个功能。这应该在1.8.2工作,因为它发生,但我只在1.8.3测试。

  DT [setkey DT [,seq(begin [1],last(end),by =day),by = acct]),
mean(amount / days,na.rm = TRUE),
= list(acct,month = format(begin,%Y-%m)),roll = TRUE]

acct month V1
1:2242 2009-10 391.34483
2:2242 2009-11 406.69448
3:2242 2009-12 601.43226
4:2242 2010-01 646.27465
5:2242 2010-02 653.32143
6:2243 2009- 10 938.51724
7:2243 2009-11 97.36172
8:2243 2009-12 375.68065
9:2243 2010-01 415.51429


In order to establish seasonal effects on energy use, I need to align the energy use information that I have from a billing database with monthly temperatures.

I'm working with a billing dataset that has bills of varying lengths and start and end dates, and I'd like to obtain the monthly average for each account within each month. For example, I have a billing database that has the following characteristics:

   acct amount      begin        end days
1  2242  11349 2009-10-06 2009-11-04   29
2  2242  12252 2009-11-04 2009-12-04   30
3  2242  21774 2009-12-04 2010-01-08   35
4  2242  18293 2010-01-08 2010-02-05   28
5  2243  27217 2009-10-06 2009-11-04   29
6  2243    117 2009-11-04 2009-12-04   30
7  2243  14543 2009-12-04 2010-01-08   35

I would like to figure out how to coerce these somewhat irregular time series (for each account) to get the average amount per day within each month that is spanned within each bill, such that:

   acct amount      begin        end days avgamtpday
1  2242  11349 2009-10-01 2009-10-31   31          X
2  2242  12252 2009-11-01 2009-11-30   30          X
3  2242  21774 2009-12-01 2010-12-31   31          X
4  2242  18293 2010-01-01 2010-01-31   31          X
4  2242  18293 2010-02-01 2010-02-28   28          X
5  2243  27217 2009-10-01 2009-10-31   31          X
6  2243    117 2009-11-01 2009-11-30   30          X
7  2243  14543 2009-12-01 2009-12-31   30          X
7  2243  14543 2010-01-01 2010-01-31   31          X

I'm fairly agnostic to whichever tool can do this, since I only have to do this once.

An additional wrinkle is the table is about 150,000 rows long, which is not really very big by most standards, but big enough to make a loop solution in R difficult. I've investigated using the zoo, xts, and tempdisagg packages in R. I started writing a really ugly loop that would split each bill, then create one row for each month within an existing bill, and then tapply() to summarize by accts and months, but honestly, couldn't see how to do it efficiently.

In MySQL, I've tried this:

create or replace view v3 as select 1 n union all select 1 union all select 1;
create or replace view v as select 1 n from v3 a, v3 b union all select 1;
set @n = 0;
drop table if exists calendar; create table calendar(dt date primary key);
insert into calendar
select cast('2008-1-1' + interval @n:=@n+1 day as date) as dt from v a, v b, v c, v d, v e, v;

select acct, amount, begin, end, billAmtPerDay, sum(billAmtPerDay), MonthAmt, count() Days, sum(billAmtPerDay)/count() AverageAmtPerDay, year(dt), month(dt) FROM ( select *, amount/days billAmtPerDay from bills b inner join calendar c on dt between begin and end and begin <> dt) x group by acct, amount, begin, end, billAmtPerDay, year(dt), month(dt);

But for reasons I don't understand, my server doesn't like this table, and gets hung up on the inner join, even when I stage the different calculations. I'm investigating if there are any temporary memory limits on it.

Thanks!

解决方案

Here's a start using data.table :

billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)

require(data.table)
DT = as.data.table(billdata)

First, change type of columns begin and end to dates. Unlike data.frame, this doesn't copy the entire dataset.

DT[,begin:=as.Date(begin)]
DT[,end:=as.Date(end)]

Then find the time span, find the prevailing bill for each day, and aggregate.

alldays = DT[,seq(min(begin),max(end),by="day")]

setkey(DT, acct, begin)

DT[CJ(unique(acct),alldays),
   mean(amount/days,na.rm=TRUE),
   by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]

    acct   month        V1
 1: 2242 2009-10 391.34483
 2: 2242 2009-11 406.69448
 3: 2242 2009-12 601.43226
 4: 2242 2010-01 646.27465
 5: 2242 2010-02 653.32143
 6: 2243 2009-10 938.51724
 7: 2243 2009-11  97.36172
 8: 2243 2009-12 375.68065
 9: 2243 2010-01 415.51429
10: 2243 2010-02 415.51429

I think you'll find the prevailing join logic quite cumbersome in SQL, and slower.

I say it's a hint because it's not quite correct. Notice row 10 is repeated because account 2243 doesn't stretch into 2010-02 unlike account 2242. To finish it off you could rbind in the last row for each account and use rolltolast instead of roll. Or perhaps create alldays by account rather than across all accounts.

See if speed is acceptable on the above, and we can go from there.

It's likely you will hit a bug in 1.8.2 that has been fixed in 1.8.3. I'm using v1.8.3.

"Internal" error message when combining join containing missing groups and group by is fixed, #2162. For example : X[Y,.N,by=NonJoinColumn] where Y contains some rows that don't match to X. This bug could also result in a seg fault.

Let me know and we can either work around, or upgrade to 1.8.3 from R-Forge.

Btw, nice example data. That made it quicker to answer.


Here's the full answer alluded to above. It's a bit tricky I have to admit, as it combines together several features of data.table. This should work in 1.8.2 as it happens, but I've only tested in 1.8.3.

DT[ setkey(DT[,seq(begin[1],last(end),by="day"),by=acct]),
    mean(amount/days,na.rm=TRUE),
    by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]

   acct   month        V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11  97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429

这篇关于将不规则时间序列划分为每月平均值 - R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆