将多核与Snow Cluster相结合 [英] Combining Multicore with Snow Cluster

查看:108
本文介绍了将多核与Snow Cluster相结合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Parallel R的全新产品.快速提问.我有一个计算量大的算法.幸运的是,可以使用multicoresnow轻松地将其分解成碎片.我想知道的是,在实践中将multicoresnow结合使用是否可行?

Fairly new to Parallel R. Quick question. I have an algorithm that is computationally intensive. Fortunately it can easily be broken up into pieces to make use of multicore or snow. What I would like to know is if it is considered fine in practice to use multicore in conjunction with snow?

我想做的是分散负载,以便在群集中的多台计算机上运行,​​并在每台计算机上运行.我想利用机器上的所有内核.对于这种类型的处理,将雪与multicore混合是否合理?

What I would like to do is split up my load to run on multiple machines in a cluster and for each machine. I would like to utilize all cores on the machine. For this type of processing, is it reasonable to mix snow with multicore?

推荐答案

我使用lockoffoff上面建议的方法,即使用并行包在具有多个内核的多台计算机上分配令人尴尬的并行工作负载.首先,工作负载分布在所有计算机上,然后每台计算机的工作负载分布在其所有核心上.这种方法的缺点是机器之间没有负载平衡(至少我不知道如何).

I have used the approach suggested above by lockedoff, that is use the parallel package to distribute an embarrassingly parallel workload over multiple machines with multiple cores. First the workload is distributed over all machines and then the workload of each machine is distributed over all it's cores. The disadvantage of this approach is that there is no load balancing between machines (at least I don't know how).

所有已加载的r代码应相同,并且在所有计算机(svn)上的相同位置上.由于初始化集群需要花费一些时间,因此可以通过重新使用创建的集群来改进下面的代码.

All loaded r code should be the same and on the same location on all machines (svn). Because initializing the clusters takes quite some time, the code below can be improved by reusing the created clusters.

foo <- function(workload, otherArgumentsForFoo) {
    source("/home/user/workspace/mycode.R")
    ...
}

distributedFooOnCores <- function(workload) {
    # Somehow assign a batch number to every record
    workload$ParBatchNumber = NA
    # Split the assigned workload into batches according to DistrParNumber
    batches = by(workload, workload$ParBatchNumber, function(x) x)

    # Create a cluster with workers on all machines 
    library("parallel")
    cluster = makeCluster(detectCores(), outfile="distributedFooOnCores.log")
    batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
    stopCluster(cluster)

    # Merge the resulting batches
    results = someEmptyDataframe
    p = 1;
    for(i in 1:length(batches)){
        results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
        p = p + nrow(batches[[i]])      
    }

    # Clean up
    workload$ParBatchNumber = NULL
    return(invisible(results))
}

distributedFooOnMachines <- function(workload) {
    # Somehow assign a batch number to every record
    workload$DistrBatchNumber = NA
    # Split the assigned activity into batches according to DistrBatchNumber
    batches = by(workload, workload$DistrBatchNumber, function(x) x)

    # Create a cluster with workers on all machines 
    library("parallel")
    # If makeCluster hangs, please make sure passwordless ssh is configured on all machines
    cluster = makeCluster(c("machine1", "etc"), master="ub2", user="", outfile="distributedFooOnMachines.log")
    batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
    stopCluster(cluster)

    # Merge the resulting batches
    results = someEmptyDataframe
    p = 1;
    for(i in 1:length(batches)){
        results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
        p = p + nrow(batches[[i]])      
    }

    # Clean up
    workload$DistrBatchNumber = NULL
    return(invisible(results))
}

我对如何改进上述方法很感兴趣.

I'm interested how the approach above can be improved.

这篇关于将多核与Snow Cluster相结合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆