将多核与Snow Cluster相结合 [英] Combining Multicore with Snow Cluster
问题描述
Parallel R的全新产品.快速提问.我有一个计算量大的算法.幸运的是,可以使用multicore
或snow
轻松地将其分解成碎片.我想知道的是,在实践中将multicore
与snow
结合使用是否可行?
Fairly new to Parallel R. Quick question. I have an algorithm that is computationally intensive. Fortunately it can easily be broken up into pieces to make use of multicore
or snow
. What I would like to know is if it is considered fine in practice to use multicore
in conjunction with snow
?
我想做的是分散负载,以便在群集中的多台计算机上运行,并在每台计算机上运行.我想利用机器上的所有内核.对于这种类型的处理,将雪与multicore
混合是否合理?
What I would like to do is split up my load to run on multiple machines in a cluster and for each machine. I would like to utilize all cores on the machine. For this type of processing, is it reasonable to mix snow with multicore
?
推荐答案
我使用lockoffoff上面建议的方法,即使用并行包在具有多个内核的多台计算机上分配令人尴尬的并行工作负载.首先,工作负载分布在所有计算机上,然后每台计算机的工作负载分布在其所有核心上.这种方法的缺点是机器之间没有负载平衡(至少我不知道如何).
I have used the approach suggested above by lockedoff, that is use the parallel package to distribute an embarrassingly parallel workload over multiple machines with multiple cores. First the workload is distributed over all machines and then the workload of each machine is distributed over all it's cores. The disadvantage of this approach is that there is no load balancing between machines (at least I don't know how).
所有已加载的r代码应相同,并且在所有计算机(svn)上的相同位置上.由于初始化集群需要花费一些时间,因此可以通过重新使用创建的集群来改进下面的代码.
All loaded r code should be the same and on the same location on all machines (svn). Because initializing the clusters takes quite some time, the code below can be improved by reusing the created clusters.
foo <- function(workload, otherArgumentsForFoo) {
source("/home/user/workspace/mycode.R")
...
}
distributedFooOnCores <- function(workload) {
# Somehow assign a batch number to every record
workload$ParBatchNumber = NA
# Split the assigned workload into batches according to DistrParNumber
batches = by(workload, workload$ParBatchNumber, function(x) x)
# Create a cluster with workers on all machines
library("parallel")
cluster = makeCluster(detectCores(), outfile="distributedFooOnCores.log")
batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
stopCluster(cluster)
# Merge the resulting batches
results = someEmptyDataframe
p = 1;
for(i in 1:length(batches)){
results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
p = p + nrow(batches[[i]])
}
# Clean up
workload$ParBatchNumber = NULL
return(invisible(results))
}
distributedFooOnMachines <- function(workload) {
# Somehow assign a batch number to every record
workload$DistrBatchNumber = NA
# Split the assigned activity into batches according to DistrBatchNumber
batches = by(workload, workload$DistrBatchNumber, function(x) x)
# Create a cluster with workers on all machines
library("parallel")
# If makeCluster hangs, please make sure passwordless ssh is configured on all machines
cluster = makeCluster(c("machine1", "etc"), master="ub2", user="", outfile="distributedFooOnMachines.log")
batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
stopCluster(cluster)
# Merge the resulting batches
results = someEmptyDataframe
p = 1;
for(i in 1:length(batches)){
results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
p = p + nrow(batches[[i]])
}
# Clean up
workload$DistrBatchNumber = NULL
return(invisible(results))
}
我对如何改进上述方法很感兴趣.
I'm interested how the approach above can be improved.
这篇关于将多核与Snow Cluster相结合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!