数据表高效回收 [英] data.table efficient recycling

查看:78
本文介绍了数据表高效回收的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常在data.table中使用回收,例如当我需要对未来几年进行预测时。
我将在以后的每一年重复我的原始数据。

I frequently use recycling in data.table, for exemple when I need to make projections future years. I repeat my original data fro each future year.

这可能会导致类似的情况:

This can lead to something like that :

library(data.table)
dt <- data.table(cbind(1:500000, 500000:1))
dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ]

但是我经常不得不处理数百万行,并且比这个玩具示例中的列要多得多。
时间增加..
尝试以下操作:

But I often have to deal with millions of lines, and far more columns than in this toy exemple. The time increases .. Try this :

library(data.table)
dt <- data.table(cbind(1:50000000, 50000000:1))
dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ]

我的问题是:是否可以更有效地实现这一目的?

My question is : is there a more efficient to achieve this purpose ?

谢谢您的帮助!

编辑:
接受的答案是最完整的(到目前为止),对于问题的这种表述,但我意识到我的问题有些棘手。
为了显示它,我将问另一个问题: data.table高效回收V2

推荐答案

我正在将迄今为止给出的解决方案与我自己的解决方案进行基准测试(仅使用 lapply rbindlist )。我无法运行全部任务,因为我的内存不足。这就是为什么我选择较小的dt的原因:

I'm benchmarking the solutions given so far against my own (which simply uses lapply and rbindlist). I couldn't run the entire task because I run out of memory. That's why I choose a smaller dt:

library(data.table)

dt <- data.table(cbind(1:5000000, 5000000:1))

original <- function() {
  dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ]
  dt2
}

sb <- function() {
  dt2 <- dt[CJ(V1, year = 1:10), on = "V1"]
}

gregor <- function() {
  CJDT <- function(...) {
    Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))
  }
  years = data.table(year = 1:10, key = "year")
  setkey(dt)
  dt3 = CJDT(dt, years)
  dt3
}

bindlist <- function() {
  dt3 <- rbindlist(lapply(1:10, function(x) {
    dt$year <- x
    dt
  }))
  # dt3 <- setcolorder(dt3, c("nrow", "V1", "V2", "year")) # to get exactly same dt
  # dt3 <- dt3[order(nrow)]
  dt3
}



基准



Benchmark

library(bench)
res <- mark(
  original = original(),
  sb = sb(),
  gregor = gregor(),
  bindlist = bindlist(),
  iterations = 1,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
res
#> # A tibble: 4 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 original      5.88s    5.88s     0.170    1.72GB   16.0  
#> 2 sb            1.76s    1.76s     0.570    1.73GB    0.570
#> 3 gregor        1.87s    1.87s     0.536  972.86MB    0    
#> 4 bindlist   558.69ms 558.69ms     1.79     1.12GB    0

summary(res, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 original   10.5   10.5       1         1.81      Inf
#> 2 sb          3.14   3.14      3.35      1.82      Inf
#> 3 gregor      3.34   3.34      3.15      1         NaN
#> 4 bindlist    1      1        10.5       1.18      NaN

reprex包(v0.3.0)

Created on 2019-12-03 by the reprex package (v0.3.0)

现在结果不完全相同(请参阅我的解决方案中的注释代码以对其进行更正),但与您尝试执行的操作等效。我的 lapply 加上 rbindlist 的解决方案令人惊讶地是将速度提高了3倍。这可能会在整个任务上发生变化但我对此表示怀疑。

Now the results are not exactly the same (see commented code in my solution for correcting it) but equivalent to what you are trying to do. My lapply plus rbindlist solution is suprisingly the fastet by a factor of more than 3. This might change on the full task but I doubt it.

这篇关于数据表高效回收的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆