并行或按顺序执行foreach循环给定一个条件 [英] Execute foreach loop in parallel or sequentially given a condition

查看:519
本文介绍了并行或按顺序执行foreach循环给定一个条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通常最终得到了几个嵌套的 foreach 循环,有时在编写一般函数(例如一个包)时,没有明显的并行化级别。有什么办法可以完成下面介绍的模型?

  foreach(i = 1:I)%if(I < J)`do` else` dopar`%{
foreach(j = 1:J)%if(I> = J)`do` else`dopar`%{







$ b

另外,有没有办法检测一个并行后端是注册,所以我可以避免获取不必要的警告消息?这在CRAN提交之前检查包时是有用的,并且不会妨碍用户在单核计算机上运行R.

  foreach i = 1:I)%if(is.parallel.backend.registered())`dopar` else`do`%{
#做东西
}



感谢您的宝贵时间。
$ b 编辑:谢谢非常多的核心和工作人员的反馈,你是对的,处理上述例子的最好方法是重新考虑整个设置。我更喜欢下面的 triu 这个想法,但是它基本上是一样的。当然也可以使用Joris建议的并行 tapply 来完成。 $ c> ij < - expand.grid(i = 1:I,j = 1:J)
foreach(i = ij $ I,j = ij $ J)%dopar%{
myFuction (i,j)
}

然而,我试图简化给出的情况上升到这个线程我遗漏了一些关键的细节。想象一下,我有两个函数 analyze batch.analyse ,并行化的最佳级别可能会有所不同,具体取决于 n.replicates n.time.points 的值。

  analyze < -  function(x,y,n.replicates = 1000){
foreach(r = 1 :n.replicates)%do%{
#做x和y的东西
}
}
batch.analyse< - function(x,y,n.replicates = 10,n.time.points = 1000){
foreach(tp = 1:time.points)%do%{
my.y< - my.func(y,tp)
分析(x,my.y,n.replicates)
}
}



<如果 n.time.points> n.replicates batch.analyse 中并行化是有道理的,但是在 analyze 中并行化更有意义。 code>。任何想法如何解决它?如果并行化已经发生,是否可以在 analyze 中检测?

解决方案

您提出的问题是foreach嵌套运算符'%:%'的动机。如果内部循环的主体需要大量的计算时间,那么使用下面的命令可以很安全:

$ p $ foreach(i = 1:I)%:%
foreach(j = 1:J)%dopar%{
#做东西
}


这个展开嵌套循环,导致可以全部并行执行的(I * J)任务。



如果内循环的身体不需要太多的时间,解决方案就更加困难。标准的解决方案是并行化外部循环,但是这仍然可能导致许多小任务(当我很大,J很小)或者一些大的任务(当我很小,J很大时)。



我最喜欢的解决方案是使用嵌套操作符进行任务分块。下面是一个使用doMPI后端的完整示例:

$ p $ library $($)
cl registerDoMPI(cl)
I < - 100; J < - 2
opt < - list(chunkSize = 10)
foreach(i = 1:I,.combine ='cbind',.options.mpi = opt)%:%
foreach(j = 1:J,.combine ='c')%dopar%{
(i * j)
}
closeCluster(cl)

这会产生20个任务块,每个块由10个循环体计算组成。如果您想要为每个工作人员创建一个任务块,则可以计算块大小:

  cs < - 上限((I * J)/ getDoParWorkers())
opt< - list(chunkSize = cs)

不幸的是,并不是所有的并行后端都支持任务分块。此外,doMPI不支持Windows。



有关此主题的更多信息,请参阅我的小插件在Foreach循环中嵌套Foreach循环:

  library(foreach)
vignette('nesting')


I often end up with several nested foreach loops and sometimes when writing general functions (e.g. for a package) there is no level which is obvious to parallelize at. Is there any way to accomplish what the mock-up below describes?

foreach(i = 1:I) %if(I < J) `do` else `dopar`% {
    foreach(j = 1:J) %if(I >= J) `do` else `dopar`% {
        # Do stuff
    }
}

Furthermore, is there some way to detect if a parallel backend is registered so I can avoid getting unnecessary warning messages? This would be useful both when checking packages prior to CRAN submission and to not bother users running R on single core computers.

foreach(i=1:I) %if(is.parallel.backend.registered()) `dopar` else `do`% {
    # Do stuff
}

Thanks for your time.

Edit: Thank you very much for all the feedback on cores and workers and you're right in that the best way to deal with the above example would be to rethink the whole setup. I'd prefer something like to below to the triu idea but it's essentially the same point. And it could of course also be done with a parallel tapply like Joris suggested.

ij <- expand.grid(i=1:I, j=1:J)
foreach(i=ij$I, j=ij$J) %dopar% {
    myFuction(i, j)
}

However, in my attempt to simplify the situation that gave rise to this thread I left out some crucial details. Imagine that I have two functions analyse and batch.analyse and the best level to parallelize at might be different depending on the values of n.replicates and n.time.points.

analyse <- function(x, y, n.replicates=1000){
    foreach(r = 1:n.replicates) %do% {
        # Do stuff with x and y
    }
}
batch.analyse <- function(x, y, n.replicates=10, n.time.points=1000){
    foreach(tp = 1:time.points) %do% {
        my.y <- my.func(y, tp)
        analyse(x, my.y, n.replicates)
    }
}

If n.time.points > n.replicates it makes sense to parallelize in batch.analyse but otherwise it makes more sense to parallelize in analyse. Any ideas on how to tackle it? Would it somehow be possible to detect in analyse if parallelization has already taken place?

解决方案

The issue that you raise was the motivation for the foreach nesting operator, '%:%'. If the body of the inner loop takes a substantial amount of compute time, you're pretty safe using:

foreach(i = 1:I) %:%
    foreach(j = 1:J) %dopar% {
        # Do stuff
    }

This "unrolls" the nested loops, resulting in (I * J) tasks that can all be executed in parallel.

If the body of the inner loop doesn't take much time, the solution is more difficult. The standard solution is to parallelize the outer loop, but that could still result in either many small tasks (when I is large and J is small) or a few large tasks (when I is small and J is large).

My favorite solution is to use the nesting operator with task chunking. Here's a complete example using the doMPI backend:

library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
I <- 100; J <- 2
opt <- list(chunkSize=10)
foreach(i = 1:I, .combine='cbind', .options.mpi=opt) %:%
    foreach(j = 1:J, .combine='c') %dopar% {
        (i * j)
    }
closeCluster(cl)

This results in 20 "task chunks", each consisting of 10 computations of the loop body. If you want to have a single task chunk for each worker, you can compute the chunk size as:

cs <- ceiling((I * J) / getDoParWorkers())
opt <- list(chunkSize=cs)

Unfortunately, not all parallel backends support task chunking. Also, doMPI doesn't support Windows.

For more information on this topic, see my vignette "Nesting Foreach Loops" in the foreach package:

library(foreach)
vignette('nesting')

这篇关于并行或按顺序执行foreach循环给定一个条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆