平均值跨时期的分摊 [英] Flat Apportionment of values across time periods
问题描述
对于 id
的不同值,我有一个开始
和结束
有相对数量的日期, var
。
对于每个记录(对于相同的 id
),开始
日期与上一个 end
日期(此处 roll
...)。
For different values of id
I have a start
and end
dates with a relative quantity, var
.
For each records (for the same id
), start
date is the same then the previous end
date (here it comes roll
...).
这些期间跨越多个月,可能多年。我的需要是将 var
中的数量分成相对于每个月的实际天数的部分。例如
These periods span across multiple months and possibly years. My need is to split the quantity in var
into parts relative to the actual days in each months. e.g.
start end var
30/01/2006 20/02/2006 104
以上我有21天,下限属于上一期,上限为当前,所以1/21的104将分配到2006年1月和其余的2006年2月
above I have 21 days, the lower limit will belong to the previous period and the upper to the current, so 1/21 of 104 will be assigned to Jan 2006 and the rest to Feb 2006
我目前有两种方法,下面列出了虚拟数据,但是他们很慢,
I currently have two methods, listed below with dummy data, but they are pretty slow and I was wondering if someone may help with me out to speed them up.
library(data.table)
# data
set.seed(1)
nsample <- 200L # To increase the data size just change nsample
dt <- data.table(id= 1L:nsample)
dt <- dt[, list(date=sample(seq(as.Date("2006-01-01"), as.Date("2012-01-01"), "day"), 51, F)), by=id]
setkey(dt)
dt <- dt[, {tmp <- embed(as.vector(date), 2);list(start = structure(tmp[,2], class="Date"),
end = structure(tmp[,1], class="Date"),
var = rnorm(50, 100, 5))}, by=id]
setkey(dt, id, end)
> dt[1:4]
id start end var
1: 1 2006-01-30 2006-02-20 104.41542
2: 1 2006-02-20 2006-05-15 106.89356
3: 1 2006-05-15 2006-08-21 106.71162
4: 1 2006-08-21 2006-09-30 96.21729
# Method 1
dt1 <- copy(dt)
system.time({
dt1[, id2 := 1:.N]
tmp <- dt1[, list(id = id,
date = seq(start+1, end, "day"),
var = var), by=id2]
tmp[, var := var/(.N), by=id2]
res1 <- tmp[, list(var = sum(var)), by=list(id, period = paste(year(date), month(date), sep="-"))]
})
#user system elapsed
#1.92 0.00 1.92
# Method 2
dt2 <- copy(dt)
system.time({
dt2[, Ndays := as.integer(end)-as.integer(start)]
tmp <- dt2[, list(date = seq(min(start)+1, max(end), "day")), by=id]
setkey(tmp)
res2 <- dt2[ tmp, roll=-Inf][ end >= start,list(var = sum(var/Ndays)), by=list(id, period = paste(year(end), month(end), sep="-")) ]
})
#user system elapsed
# 0.7 0.0 0.7
> sum(dt$var) == sum(res1$var)
[1] TRUE
> sum(dt$var) == sum(res2$var)
[1] TRUE
> all.equal(res1, res2)
[1] TRUE
> res2[1:4]
id period var
1: 1 2006-1 4.972163
2: 1 2006-2 109.623593
3: 1 2006-3 39.448815
4: 1 2006-4 38.176273
推荐答案
(比我的第三个版本快3倍)。我在你的第二个版本优化了几个东西,你可以看到如下:
This will be a bit faster (it's 3x faster for me than your second version). I optimized several things in your second version, that you can see below:
# let's just divide here instead of later
dt2[, var := var/(as.integer(end)-as.integer(start))]
tmp <- dt2[, list(date = seq(min(start)+1, max(end), "day")), by=id]
# data is sorted, so no need to sort again, just set key without sort
setattr(tmp, "sorted", c("id", "date"))
res2 <- dt2[tmp, roll=-Inf][,
list(var = sum(var)),
# doing the paste in by slows it down quite a bit, so let's postpone it
by=list(id, year(end), month(end))][,
`:=`(period = paste(year, month, sep = '-'), year = NULL, month = NULL)]
重新评论大尺寸 - 您可以在 dt2
中做上述所有操作。它会更慢,但我不会创建一个大的 tmp
:
Re comment about large sizes - you could do all of the above inside dt2
. It'll be slower, but I it won't create a large tmp
:
dt2[, var := var/(as.integer(end)-as.integer(start))][,
{tmp = data.table(date = seq(min(start)+1, max(end), "day"));
setattr(tmp, 'sorted', 'date');
setattr(.SD, 'sorted', 'end');
.SD[tmp, roll = -Inf][,
list(var = sum(var)), by = list(year(end), month(end))][,
`:=`(period = paste(year, month, sep = '-'), year = NULL, month = NULL)]
}, by = id]
这篇关于平均值跨时期的分摊的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!