R 中张量的 doParallel 性能 [英] doParallel performance on a tensor in R

查看:56
本文介绍了R 中张量的 doParallel 性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要对张量执行一些操作,我想让它并行.考虑以下示例:

I need to perform some operations on a tensor and I would like make this parallel. Consider the following example:

# first part without doParallel

N = 8192
M = 128
F = 64

ma <- function(x,n=5){filter(x,rep(1/n,n), sides=2)}


m <- array(rexp(N*M*F), dim=c(N,M,F))

new_m <- array(0, dim=c(N,M,F))

system.time ( for(i in 1:N) {
        for(j in 1:F) {
            ma_r <- ma(m[i,,j],2)
            ma_r <- c(ma_r[-length(ma_r)], ma_r[(length(ma_r)-1)])
            new_m[i,,j] <- ma_r       
        }
    }
)

这在我的笔记本电脑中大约需要 38 秒.以下是使用 doParallel:

This takes around 38 seconds in my laptop. The following is with doParallel:

# second part with doParallel

library(doParallel)  
no_cores <- detectCores() - 1  
cl <- makeCluster(no_cores, type="FORK")  
registerDoParallel(cl)


calcMat <- function(x){

    n <- dim(x)[1]
    m <- dim(x)[2]

    new_x <- matrix(0, nrow=n, ncol=m)

    for(j in 1:ncol(x)) {
        ma_r <- ma(x[,j],2)
        ma_r <- c(ma_r[-length(ma_r)], ma_r[(length(ma_r)-1)])
        new_x[,j] <- ma_r       
    }

    return(new_x)

}


system.time ( a_list <- foreach(i=1:N) %dopar% {
    m_m <- m[i,,]
    new_m_m <- calcMat(m_m)
 }
)


Y <- array(unlist(a_list), dim = c(nrow(a_list[[1]]), ncol(a_list[[1]]), length(a_list)))
Y <- aperm(Y, c(3,1,2))


stopCluster(cl) 

第二个大约需要 36 秒.所以我没有看到任何时间上的改进.有谁知道这是什么原因?

This second one takes around 36 seconds. So I do not see any improvement in terms of time. Does anyone know what is the reason for that?

推荐答案

当您想要使用并行化时,您需要注意某些事情.第一个是由于通信和可能的序列化而产生开销.作为一个非常粗略的例子,考虑以下几点:

You need to be aware of certain things when you want to use parallelization. The first one is that there is an overhead due to communication and possibly serialization. As a very crude example, consider the following:

num_cores <- 2L
cl <- makeCluster(num_cores, type="FORK")
registerDoParallel(cl)

exec_time <- system.time({
    a_list <- foreach(i=1L:2L) %dopar% {
        system.time({
            m_m <- m[i,,]
            new_m_m <- calcMat(m_m)
        })
    }
})

在我的系统中,exec_time 显示经过的时间为 1.264 秒,而 a_list 中的经过时间各显示 0.003 秒.因此,以一种非常简化的方式,我们可以说 99.7% 的执行时间是开销.这与任务粒度有关.不同类型的任务受益于不同类型的粒度.在你的情况下,您可以粗略分块您的任务中受益.这基本上意味着您以减少通信开销的方式对任务数量进行分组:

In my system, exec_time shows an elapsed time of 1.264 seconds, whereas the elapsed times in a_list each show 0.003 seconds. So in a very simplified way we could say that 99.7% of the execution time was overhead. This has to do with task granularity. Different types of tasks benefit from different types of granularity. In your case, you can benefit from chunking your tasks in a coarse way. This basically means that you group the number of tasks in a way that reduces communication overhead:

chunks <- splitIndices(N, num_cores)
str(chunks)
List of 2
 $ : int [1:4096] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:4096] 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 ...

每个块都有几个任务的索引,所以你需要适当地修改你的代码:

Each chunk has indices for several tasks, so you need to modify your code appropriately:

exec_time_chunking <- system.time({
    a_list <- foreach(chunk=chunks, .combine=c) %dopar% {
        lapply(chunk, function(i) {
            m_m <- m[i,,]
            calcMat(m_m)
        })
    }
})

以上在我的系统中用了 17.978 秒完成,使用 2 个并行工作器.

The above completed in 17.978 seconds in my system, using 2 parallel workers.

作为旁注,我认为通常没有充分的理由将并行工作线程的数量设置为 detectCores() - 1L,由于主 R 进程必须等待所有并行工作进程完成,但也许你还有其他原因,也许保持系统响应能力.

as a side note, I think there's usually no good reason to set the number of parallel workers to detectCores() - 1L, since the main R process has to wait for all parallel workers to finish, but maybe you have other reasons, perhaps maintaining system responsiveness.

这篇关于R 中张量的 doParallel 性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆