在R中并行化异构任务:foreach,doMC,doParallel [英] parallelizing heterogenous tasks in R: foreach, doMC, doParallel

查看:86
本文介绍了在R中并行化异构任务:foreach,doMC,doParallel的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这一直困扰着我:

当您使用foreach计划一系列内容相同但处理时间不同的任务(事前未知)时,foreach如何准确地依次处理这些令人尴尬的并行任务?

When you schedule a sequence of tasks that are homogenous in terms of content but heterogenous in terms of processing time (not known ex ante) using foreach, how exactly does foreach process these embarrassingly parallel tasks sequentially?

例如,我注册了4个线程registerDoMC(cores=4),我有10个任务,而第4和第5个任务的执行时间比所有其他任务的执行时间要长.那么第一批显然是第一,第二,第三和第四.完成第1,第2和第3个步骤后,foreach如何准确地依次分配其他任务?那是随机的吗(从我的观察来看似乎如此)?如果事实证明某些任务需要更长的时间来处理,那么有什么好的加速方法呢?

For instance, I registered 4 threads registerDoMC(cores=4) and I have 10 tasks and the 4th and the 5th each turned out to be longer than all others combine. Then the first batch is obviously the 1st, 2nd, 3rd and 4th. When the 1st, 2nd and 3rd are done, how exactly does foreach assign other tasks sequentially? Is that random (which seems so from my observation)? And what's a good practice to speed up if it turns out some tasks take way longer time to process?

对不起,我没有提供具体的示例,因为我的实际项目/代码涉及更多...

I am sorry for not providing concrete examples since my actual projects/codes are much more involved...

非常感谢任何经验/指导/指针!

Any experiences/guidance/pointers are very much appreciated!

推荐答案

doMC软件包是mclapply的包装,默认情况下,mclapply是 preschedules 任务的包装,这意味着它将任务分为几组,或者.所不同的是,它预先安排了这些任务的轮循时间.因此,如果您有10个任务和4个工作人员,则任务将分配如下:

The doMC package is a wrapper around mclapply, and by default mclapply preschedules tasks, which means it splits the tasks into groups, or chunks. The twist is that it preschedules those tasks round-robin. Thus, if you have 10 tasks and 4 workers, the tasks will be assigned as follows:

  • 工人1:任务1、5、9
  • 工人2:任务2、6、10
  • 工人3:任务3、7
  • 工人4:任务4、8

如果幸运的话,即使任务的长度截然不同,这也可以提供合理的性能,但是您可以按以下方式在doMC中禁用预调度:

If you're lucky, this will give reasonable performance even if the tasks have very different lengths, but you can disable prescheduling in doMC as follows:

opts <- list(preschedule=FALSE)
results <- foreach(i=1:10, .options.multicore=opts) %dopar% {
    # ...
}

这将导致doMC使用mc.preschedule=FALSE选项调用mclapply,以便在完成先前的任务(自然是负载平衡)时将任务分配给工作人员.

This will cause doMC to call mclapply with the mc.preschedule=FALSE option so that tasks are assigned to workers as they complete their previous task which is naturally load balancing.

这篇关于在R中并行化异构任务:foreach,doMC,doParallel的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆