并行化R脚本 [英] Parallelize an R Script

查看:122
本文介绍了并行化R脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的R脚本的问题在于它花费了太多时间,我认为主要的解决方案是对其进行并行化.我不知道从哪里开始.

The problem with my R script is that it takes too much time and the main solution that I consider is to parallelize it. I don't know where to start.

我的代码如下:

n<- nrow (aa) 
output <- matrix (0, n, n)

akl<- function (dii){
        ddi<- as.matrix (dii)
        m<- rowMeans(ddi)
        M<- mean(ddi)
        r<- sweep (ddi, 1, m)
        b<- sweep (r, 2, m)
        return (b + M)  
        }
for (i in 1:n)
{
A<- akl(dist(aa[i,]))

dVarX <- sqrt(mean (A * A))

for (j in i:n)
{
    B<- akl(dist(aa[j,]))
        V <- sqrt (dVarX * (sqrt(mean(B * B))))

        output[i,j] <- (sqrt(mean(A * B))) / V        
}
}   

我想并行化不同的CPU.我怎样才能做到这一点? 我看到了SNOW软件包,它适合我的目的吗? 谢谢你的建议, 乱七八糟

I would like to parallelize on different cpus. How can I do that? I saw the SNOW package, is it suitable for my purpose? Thank you for suggestions, Gab

推荐答案

我可以想到两种方法使代码运行得更快:

There are two ways in which your code could be made to run faster that I could think of:

First:正如@Dwin所说(稍作改动),您可以precompute akl(是的,不是必需的距离,而是整个akl).

First: As @Dwin was saying (with a small twist), you could precompute akl (yes, not necesarily dist, but the whole of akl).

# a random square matrix
aa <- matrix(runif(100), ncol=10)
n <- nrow(aa)
output <- matrix (0, n, n)

akl <- function(dii) {
    ddi <- as.matrix(dii)
    m   <- rowMeans(ddi)
    M   <- mean(m) # mean(ddi) == mean(m)
    r   <- sweep(ddi, 1, m)
    b   <- sweep(r, 2, m)
    return(b + M)
}

# precompute akl here
require(plyr)
akl.list <- llply(1:nrow(aa), function(i) {
    akl(dist(aa[i, ]))
})

# Now, apply your function, but index the list instead of computing everytime
for (i in 1:n) {
    A     <- akl.list[[i]]
    dVarX <- sqrt(mean(A * A))

    for (j in i:n) {
        B <- akl.list[[j]]
        V <- sqrt (dVarX * (sqrt(mean(B * B))))
        output[i,j] <- (sqrt(mean(A * B))) / V        
    }
}

这应该已经使您的代码在较大的矩阵上比以前运行得更快(因为您每次在内循环中都计算akl).

This should already get your code to run faster than before (as you compute akl everytime in the inner loop) on larger matrices.

Second:除此之外,还可以通过如下并行化来更快地获得它:

Second: In addition to that, you can get it faster by parallelising as follows:

# now, the parallelisation you require can be achieved as follows
# with the help of `plyr` and `doMC`.

# First step of parallelisation is to compute akl in parallel
require(plyr)
require(doMC)
registerDoMC(10) # 10 Cores/CPUs
    akl.list <- llply(1:nrow(aa), function(i) {
    akl(dist(aa[i, ]))
}, .parallel = TRUE)

# then, you could write your for-loop using plyr again as follows
output <- laply(1:n, function(i) {
    A     <- akl.list[[i]]
    dVarX <- sqrt(mean(A * A))

    t <- laply(i:n, function(j) {
        B <- akl.list[[j]]
        V <- sqrt(dVarX * (sqrt(mean(B*B))))
        sqrt(mean(A * B))/V
    })
    c(rep(0, n-length(t)), t)
}, .parallel = TRUE)

请注意,我仅在外部循环上添加了.parallel = TRUE.这是因为,您将10个处理器分配给了外部循环.现在,如果将其同时添加到外部和内部循环中,则处理器的总数将为10 * 10 = 100.

Note that I have added .parallel = TRUE only on the outer loop. This is because, you assign 10 processors to the outer loop. Now, if you add it to both outer and inner loops, then the total number of processers will be 10 * 10 = 100. Please take care of this.

这篇关于并行化R脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆