通过使用多核和并行编程加速 data.table 组 [英] Speed-up data.table group by using multiple cores and parallel programming
问题描述
我的代码很大,聚合步骤是当前速度方面的瓶颈.
I have a large code and the aggregation step is the current bottleneck in terms of speed.
在我的代码中,我希望加快数据分组步骤的速度.我的数据的 SNOTE(简单的非平凡示例)如下所示:
In my code I'd like to speed-up the data grouping step to be faster. A SNOTE (simple non trivial example) of my data looks like this:
library(data.table)
a = sample(1:10000000, 50000000, replace = TRUE)
b = sample(c("3m","2m2d2m","3m2d1i3s2d","5m","4m","9m","1m"), 50000000, replace = TRUE)
d = sample(c("3m","2m2d2m","3m2d1i3s2d","5m","4m","9m","1m"), 50000000, replace = TRUE)
e = a
dt = data.table(a = a, b = b, d = d, e = e)
system.time(c.dt <- dt[,list(b = paste(b, collapse=""), d = paste(d, collapse=""), e = e[1], by=a)])
user system elapsed
60.107 3.143 63.534
对于如此大的数据示例,这相当快,但就我而言,我仍在寻找进一步的加速.就我而言,我有多个内核,所以我几乎可以肯定一定有一种方法可以使用这种计算能力.
This is quite fast for such large data example but in my case I am still looking for further speed-up. In my case I have multiple cores so I am almost sure there must be a way to use such computational capability.
我愿意将我的数据类型更改为 data.frame 或 idata.frame 对象(理论上 idata.frame 应该比 data.frames 更快).
I am open to changing my data type to a data.frame, or idata.frame objects (in theory idata.frame are supposedly faster than data.frames).
我做了一些研究,似乎 plyr 包有一些并行功能可能会有所帮助,但我仍在努力为我正在尝试做的分组做这件事.在另一篇 SO 帖子中,他们讨论了其中一些想法.由于它使用了 foreach 函数,我仍然不确定通过这种并行化可以实现多少.根据我的经验, foreach 函数 不是很好数百万次快速操作的想法,因为内核之间的通信工作最终会减慢并行化工作.
I did some research and seems the plyr package has some parallel capabilities that could be helpful but I am still struggling on how to do it for the grouping I am trying to do. In another SO post they discuss some of these ideas. I am still unsure on how much more I'd achieve with this parallelization since it uses the foreach function. In my experience the foreach function is not a good idea for millions of fast operations because the communication effort between cores ends up slowing down the parallelization effort.
推荐答案
如果您有多个内核可用,为什么不利用您可以快速过滤 &使用其键对 data.table 中的行进行分组:
If you have multiple cores available to you, why not leverage the fact that you can quickly filter & group rows in a data.table using its key:
library(doMC)
registerDoMC(cores=4)
setkey(dt, "a")
finalRowOrderMatters = FALSE # FALSE can be faster
foreach(x=unique(dt[["a"]]), .combine="rbind", .inorder=finalRowOrderMatters) %dopar%
dt[.(x) ,list(b = paste(b, collapse=""), d = paste(d, collapse=""), e = e[[1]])]
<小时>
请注意,如果唯一组的数量(即 length(unique(a))
)相对较少,则删除 .combine
参数会更快,将结果返回到列表中,然后对结果调用 rbindlist
.在我对两个核心的测试中8GB RAM,阈值约为 9,000 个唯一值.这是我用来进行基准测试的内容:
Note that if the number of unique groups (ie length(unique(a))
) is relatively small, it will be faster to drop the .combine
argument, get the results back in a list, then call rbindlist
on the results. In my testing on two cores & 8GB RAM, the threshold was at about 9,000 unique values. Here is what I used to benchmark:
# (otion a)
round(rowMeans(replicate(3, system.time({
# ------- #
foreach(x=unique(dt[["a"]]), .combine="rbind", .inorder=FALSE) %dopar%
dt[.(x) ,list(b = paste(b, collapse=""), d = paste(d, collapse=""), e = e[[1]])]
# ------- #
}))), 3)
# [1] 1.243 elapsed for N == 1,000
# [1] 11.540 elapsed for N == 10,000, length(unique(dt[["a"]])) == 8617
# [1] 57.404 elapsed for N == 50,000
# (otion b)
round(rowMeans(replicate(3, system.time({
# ------- #
results <-
foreach(x=unique(dt[["a"]])) %dopar%
dt[.(x) ,list(b = paste(b, collapse=""), d = paste(d, collapse=""), e = e[[1]])]
rbindlist(results)
# ------- #
}))), 3)
# [1] 1.117 elapsed for N == 1,000
# [1] 10.567 elapsed for N == 10,000, length(unique(dt[["a"]])) == 8617
# [1] 76.613 elapsed for N == 50,000
## And used the following to create the dt
N <- 5e4
set.seed(1)
a = sample(1:N, N*2, replace = TRUE)
b = sample(c("3m","2m2d2m","3m2d1i3s2d","5m","4m","9m","1m"), N*2, replace = TRUE)
d = sample(c("3m","2m2d2m","3m2d1i3s2d","5m","4m","9m","1m"), N*2, replace = TRUE)
e = a
dt = data.table(a = a, b = b, d = d, e = e, key="a")
这篇关于通过使用多核和并行编程加速 data.table 组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!