加速数据表通过使用多核并行编程 [英] Speed-up data.table group by using multiple cores and parallel programming

查看：83 发布时间：2017/3/12 11:22:05 r data.table mclapply

本文介绍了加速数据表通过使用多核并行编程的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个大代码，聚合步骤是目前的速度瓶颈。

I have a large code and the aggregation step is the current bottleneck in terms of speed.

在我的代码中，我想加快数据分组步骤的速度。我的数据的SNOTE（简单的非常简单的例子）如下：

In my code I'd like to speed-up the data grouping step to be faster. A SNOTE (simple non trivial example) of my data looks like this:

library(data.table)
a = sample(1:10000000, 50000000, replace = TRUE)
b = sample(c("3m","2m2d2m","3m2d1i3s2d","5m","4m","9m","1m"), 50000000, replace = TRUE)
d = sample(c("3m","2m2d2m","3m2d1i3s2d","5m","4m","9m","1m"), 50000000, replace = TRUE)
e = a
dt = data.table(a = a, b = b, d = d, e = e)
system.time(c.dt <- dt[,list(b = paste(b, collapse=""), d = paste(d, collapse=""), e = e[1], by=a)])
   user  system elapsed 
 60.107   3.143  63.534

这样的大数据示例速度非常快，但在我的例子中，我还在寻找更多加速。在我的情况下，我有多个核心，所以我几乎肯定必须有一个方法使用这样的计算能力。

This is quite fast for such large data example but in my case I am still looking for further speed-up. In my case I have multiple cores so I am almost sure there must be a way to use such computational capability.

我可以改变我的数据类型为一个数据。 frame或idata.frame对象（理论上idata.frame应该比data.frames快）。

I am open to changing my data type to a data.frame, or idata.frame objects (in theory idata.frame are supposedly faster than data.frames).

我做了一些研究，似乎plyr包有一些并行的功能，可以是有帮助的，但我仍然在努力如何做的分组我想做。在另一个SO帖子中，他们讨论了其中的一些想法。我仍然不确定我将用这个并行化实现多少，因为它使用foreach函数。根据我的经验， foreach函数不是一个好的

I did some research and seems the plyr package has some parallel capabilities that could be helpful but I am still struggling on how to do it for the grouping I am trying to do. In another SO post they discuss some of these ideas. I am still unsure on how much more I'd achieve with this parallelization since it uses the foreach function. In my experience the foreach function is not a good idea for millions of fast operations because the communication effort between cores ends up slowing down the parallelization effort.

推荐答案

如果您有多个核心可用的，为什么不利用的事实，你可以快速过滤&使用其键在data.table中组行：

If you have multiple cores available to you, why not leverage the fact that you can quickly filter & group rows in a data.table using its key:

library(doMC)
registerDoMC(cores=4)


setkey(dt, "a")

finalRowOrderMatters = FALSE # FALSE can be faster
foreach(x=unique(dt[["a"]]), .combine="rbind", .inorder=finalRowOrderMatters) %dopar% 
     dt[.(x) ,list(b = paste(b, collapse=""), d = paste(d, collapse=""), e = e[[1]])]

请注意，如果唯一组的数量（即 length（unique（a）））相对较小， code> .combine 参数，将结果返回列表，然后在结果上调用 rbindlist 。在我的两个核心& 8GB RAM，阈值约为9,000个唯一值。这是我用来基准：

Note that if the number of unique groups (ie length(unique(a)) ) is relatively small, it will be faster to drop the .combine argument, get the results back in a list, then call rbindlist on the results. In my testing on two cores & 8GB RAM, the threshold was at about 9,000 unique values. Here is what I used to benchmark:

# (otion a)
round(rowMeans(replicate(3, system.time({
# ------- #
  foreach(x=unique(dt[["a"]]), .combine="rbind", .inorder=FALSE) %dopar% 
     dt[.(x) ,list(b = paste(b, collapse=""), d = paste(d, collapse=""), e = e[[1]])]
# ------- #
}))), 3) 
# [1]  1.243 elapsed for N ==  1,000
# [1] 11.540 elapsed for N == 10,000, length(unique(dt[["a"]])) == 8617
# [1] 57.404 elapsed for N == 50,000



# (otion b)
round(rowMeans(replicate(3, system.time({
# ------- #
    results <- 
      foreach(x=unique(dt[["a"]])) %dopar% 
         dt[.(x) ,list(b = paste(b, collapse=""), d = paste(d, collapse=""), e = e[[1]])]
    rbindlist(results)
# ------- #
}))), 3)
# [1]  1.117 elapsed for N ==  1,000
# [1] 10.567 elapsed for N == 10,000, length(unique(dt[["a"]])) == 8617
# [1] 76.613 elapsed for N == 50,000


## And used the following to create the dt
N <- 5e4
set.seed(1)
a = sample(1:N, N*2, replace = TRUE)
b = sample(c("3m","2m2d2m","3m2d1i3s2d","5m","4m","9m","1m"), N*2, replace = TRUE)
d = sample(c("3m","2m2d2m","3m2d1i3s2d","5m","4m","9m","1m"), N*2, replace = TRUE)
e = a
dt = data.table(a = a, b = b, d = d, e = e, key="a")

这篇关于加速数据表通过使用多核并行编程的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

加速数据表通过使用多核并行编程 [英] Speed-up data.table group by using multiple cores and parallel programming

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

加速数据表通过使用多核并行编程 [英] Speed-up data.table group by using multiple cores and parallel programming

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭