平均值跨时期的分摊 [英] Flat Apportionment of values across time periods

查看:113
本文介绍了平均值跨时期的分摊的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于 id 的不同值,我有一个开始结束有相对数量的日期, var
对于每个记录(对于相同的 id ),开始日期与上一个 end 日期(此处 roll ...)。

For different values of id I have a start and end dates with a relative quantity, var. For each records (for the same id), start date is the same then the previous end date (here it comes roll...).

这些期间跨越多个月,可能多年。我的需要是将 var 中的数量分成相对于每个月的实际天数的部分。例如

These periods span across multiple months and possibly years. My need is to split the quantity in var into parts relative to the actual days in each months. e.g.

start       end         var
30/01/2006  20/02/2006  104

以上我有21天,下限属于上一期,上限为当前,所以1/21的104将分配到2006年1月和其余的2006年2月

above I have 21 days, the lower limit will belong to the previous period and the upper to the current, so 1/21 of 104 will be assigned to Jan 2006 and the rest to Feb 2006

我目前有两种方法,下面列出了虚拟数据,但是他们很慢,

I currently have two methods, listed below with dummy data, but they are pretty slow and I was wondering if someone may help with me out to speed them up.

library(data.table)

# data
set.seed(1)
nsample <- 200L  # To increase the data size just change nsample 

dt <- data.table(id= 1L:nsample)
dt <- dt[, list(date=sample(seq(as.Date("2006-01-01"), as.Date("2012-01-01"), "day"), 51, F)), by=id]

setkey(dt)
dt <- dt[, {tmp <- embed(as.vector(date), 2);list(start = structure(tmp[,2], class="Date"),
                                                  end   = structure(tmp[,1], class="Date"),
                                                  var   = rnorm(50, 100, 5))}, by=id]
setkey(dt, id, end)

> dt[1:4]
   id      start        end       var
1:  1 2006-01-30 2006-02-20 104.41542
2:  1 2006-02-20 2006-05-15 106.89356
3:  1 2006-05-15 2006-08-21 106.71162
4:  1 2006-08-21 2006-09-30  96.21729

# Method 1

dt1 <- copy(dt)

system.time({
  dt1[, id2 := 1:.N]
  tmp <- dt1[, list(id = id,
                   date = seq(start+1, end, "day"),
                   var = var), by=id2]
  tmp[, var := var/(.N), by=id2]
  res1 <- tmp[, list(var = sum(var)), by=list(id, period = paste(year(date), month(date), sep="-"))]
})

   #user  system elapsed 
   #1.92    0.00    1.92 

# Method 2

dt2 <- copy(dt)

system.time({
  dt2[, Ndays := as.integer(end)-as.integer(start)]
  tmp <- dt2[, list(date = seq(min(start)+1, max(end), "day")), by=id]
  setkey(tmp)
  res2 <- dt2[ tmp, roll=-Inf][ end >= start,list(var = sum(var/Ndays)), by=list(id, period = paste(year(end), month(end), sep="-")) ]
})

   #user  system elapsed 
   # 0.7     0.0     0.7 


> sum(dt$var) == sum(res1$var)
[1] TRUE
> sum(dt$var) == sum(res2$var)
[1] TRUE

> all.equal(res1, res2)
[1] TRUE

> res2[1:4]
   id period        var
1:  1 2006-1   4.972163
2:  1 2006-2 109.623593
3:  1 2006-3  39.448815
4:  1 2006-4  38.176273


推荐答案

(比我的第三个版本快3倍)。我在你的第二个版本优化了几个东西,你可以看到如下:

This will be a bit faster (it's 3x faster for me than your second version). I optimized several things in your second version, that you can see below:

# let's just divide here instead of later
dt2[, var := var/(as.integer(end)-as.integer(start))]
tmp <- dt2[, list(date = seq(min(start)+1, max(end), "day")), by=id]
# data is sorted, so no need to sort again, just set key without sort
setattr(tmp, "sorted", c("id", "date"))

res2 <- dt2[tmp, roll=-Inf][,
            list(var = sum(var)),
            # doing the paste in by slows it down quite a bit, so let's postpone it
            by=list(id, year(end), month(end))][,
            `:=`(period = paste(year, month, sep = '-'), year = NULL, month = NULL)]

重新评论大尺寸 - 您可以在 dt2 中做上述所有操作。它会更慢,但我不会创建一个大的 tmp

Re comment about large sizes - you could do all of the above inside dt2. It'll be slower, but I it won't create a large tmp:

dt2[, var := var/(as.integer(end)-as.integer(start))][,
    {tmp = data.table(date = seq(min(start)+1, max(end), "day"));
     setattr(tmp, 'sorted', 'date');
     setattr(.SD, 'sorted', 'end');
     .SD[tmp, roll = -Inf][,
         list(var = sum(var)), by = list(year(end), month(end))][,
         `:=`(period = paste(year, month, sep = '-'), year = NULL, month = NULL)]
    }, by = id]

这篇关于平均值跨时期的分摊的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆