数据表和并行计算 [英] data.table and parallel computing

查看:136
本文介绍了数据表和并行计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关注此信息: RORE中的multicore和data.table ,I想知道是否有一种方法来使用所有的核心,当使用data.table,通常做组计算可以并行化。似乎 plyr 允许这样的操作设计。

Following this post: multicore and data.table in R, I was wondering if there was a way to use all cores when using data.table, typically doing calculations by groups could be parallelized. It seems that plyr allows such operations by design.

推荐答案

检查是否 data.table 常见问题3.1 point 2已沉入:

First thing to check is that data.table FAQ 3.1 point 2 has sunk in :


一个内存分配仅针对最大的组,然后
内存被重用于其他组。有很少的垃圾
收集。

One memory allocation is made for the largest group only, then that memory is reused for the other groups. There is very little garbage to collect.

这是data.table分组很快的一个原因。但是这种方法不适用于并行化。并行化意味着将数据复制到其他线程,而是花费时间。但是,我的理解是, data.table 分组通常快于 plyr .parallel 。它取决于每个组的任务的计算时间,以及如果该计算时间可以容易地减少或不减少。

That's one reason data.table grouping is quick. But this approach doesn't lend itself to parallelization. Parallelizing means copying the data to the other threads, instead, costing time. But, my understanding is that data.table grouping is usually faster than plyr with .parallel on anyway. It depends on the computation time of the task for each group, and if that compute time can be easily reduced or not. Moving the data around often dominates (when benchmarking 1 or 3 runs of large data tasks).

更多的情况是,到目前为止,它实际上是一些有趣的,在 j 表达式 [。data.table 。例如,最近我们看到 data.table 分组的性能​​很差,但是原因是 min(POSIXct)汇总R超过80,000个唯一ID )。

More often, so far, it's actually some gotcha that's biting in the j expression of [.data.table. For example, recently we saw poor performance from data.table grouping but the culprit turned out to be min(POSIXct) (Aggregating in R over 80K unique ID's). Avoiding that gotcha yielded over 50 times speedup.

因此,口头禅是: Rprof Rprof Rprof

So the mantra is: Rprof, Rprof, Rprof.

从相同的常见问题可能是重要的:

Further, point 1 from the same FAQ might be significant :


只有该列被分组,其他19被忽略,因为
data.table inspects j表达式并实现它不使用
其他列。

Only that column is grouped, the other 19 are ignored because data.table inspects the j expression and realises it doesn’t use the other columns.

所以,数据.table 真的不遵循split-apply-combine范式。它的工作方式不同。

So, data.table really doesn't follow the split-apply-combine paradigm at all. It works differently. split-apply-combine lends itself to parallelization but it really doesn't scale to large data.

另请参阅data.table intro vignette中的脚注3:

Also see footnote 3 in the data.table intro vignette :


我们想知道有多少人正在部署并行技术来编码
即向量扫描

We wonder how many people are deploying parallel techniques to code that is vector scanning

这是试图说确定,并行显着更快,但使用高效算法应该多长时间?。

That's trying to say "sure, parallel is significantly faster, but how long should it really take with an efficient algorithm?".

但是如果你已经分析(使用 Rprof ),每个组的任务真的是包含multicore一词的帮助可能会有帮助:

BUT if you've profiled (using Rprof), and the task per group really is compute intensive, then the 3 posts on datatable-help including the word "multicore" might help:

关于datatable-help的多核心帖子

multicore posts on datatable-help

当然,将在data.table很好,有一种方法来做。但它还没有完成,因为通常其他因素咬,所以它是低优先级。如果您可以发布带有基准测试和Rprof结果的可重现的哑元数据,那将有助于提高优先级。

Of course there are many tasks where parallelization would be nice in data.table, and there is a way to do it. But it hasn't been done yet, since usually other factors bite, so it's been low priority. If you can post reproducible dummy data with benchmarks and Rprof results, that would help increase the priority.

这篇关于数据表和并行计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆