为什么并行程序包比仅使用apply慢? [英] Why is the parallel package slower than just using apply?

查看:85
本文介绍了为什么并行程序包比仅使用apply慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试确定何时使用parallel包来加快运行某些分析所需的时间.我需要做的一件事是创建矩阵,比较两个具有不同行数的数据帧中的变量.我问了一个关于在上进行操作的有效方式的问题> StackOverflow ,并在我的

I am trying to determine when to use the parallel package to speed up the time necessary to run some analysis. One of the things I need to do is create matrices comparing variables in two data frames with differing number of rows. I asked a question as to an efficient way of doing on StackOverflow and wrote about tests on my blog. Since I am comfortable with the best approach I wanted to speed up the process by running it in parallel. The results below are based upon a 2ghz i7 Mac with 8gb of RAM. I am surprised that the parallel package, the parSapply funciton in particular, is worse than just using the apply function. The code to replicate this is below. Note that I am currently only using one of the two columns I create but eventually want to use both.


(来源:
bryer.org )


(source: bryer.org)

require(parallel)
require(ggplot2)
require(reshape2)
set.seed(2112)
results <- list()
sizes <- seq(1000, 30000, by=5000)
pb <- txtProgressBar(min=0, max=length(sizes), style=3)
for(cnt in 1:length(sizes)) {
    i <- sizes[cnt]
    df1 <- data.frame(row.names=1:i, 
                      var1=sample(c(TRUE,FALSE), i, replace=TRUE), 
                      var2=sample(1:10, i, replace=TRUE) )
    df2 <- data.frame(row.names=(i + 1):(i + i), 
                      var1=sample(c(TRUE,FALSE), i, replace=TRUE),
                      var2=sample(1:10, i, replace=TRUE))
    tm1 <- system.time({
        df6 <- sapply(df2$var1, FUN=function(x) { x == df1$var1 })
        dimnames(df6) <- list(row.names(df1), row.names(df2))
    })
    rm(df6)
    tm2 <- system.time({
        cl <- makeCluster(getOption('cl.cores', detectCores()))
        tm3 <- system.time({
            df7 <- parSapply(cl, df1$var1, FUN=function(x, df2) { x == df2$var1 }, df2=df2)
            dimnames(df7) <- list(row.names(df1), row.names(df2))
        })
        stopCluster(cl)
    })
    rm(df7)
    results[[cnt]] <- c(apply=tm1, parallel.total=tm2, parallel.exec=tm3)
    setTxtProgressBar(pb, cnt)
}

toplot <- as.data.frame(results)[,c('apply.user.self','parallel.total.user.self',
                          'parallel.exec.user.self')]
toplot$size <- sizes
toplot <- melt(toplot, id='size')

ggplot(toplot, aes(x=size, y=value, colour=variable)) + geom_line() + 
    xlab('Vector Size') + ylab('Time (seconds)')

推荐答案

并行运行作业会产生开销.仅当您在工作节点上触发的作业花费大量时间时,并行化才会提高整体性能.当单个作业仅花费毫秒时,不断执行作业的开销将使整体性能下降.诀窍是按照这样的方式将工作划分到各个节点上,使得这些工作足够长,例如至少几秒钟.我用它可以同时运行六个Fortran模型,效果很好,但是这些单独的模型运行花费了数小时,几乎抵消了开销的影响.

Running jobs in parallel incurs overhead. Only if the jobs you fire at the worker nodes take a significant amount of time does parallelization improve overall performance. When the individual jobs take only milliseconds, the overhead of constantly firing off jobs will deteriorate overall performance. The trick is to divide the work over the nodes in such a way that the jobs are sufficiently long, say at least a few seconds. I used this to great effect running six Fortran models simultaneously, but these individual model runs took hours, almost negating the effect of overhead.

请注意,我还没有运行您的示例,但是当并行化比顺序运行花费更长的时间时,我上面描述的情况通常是一个问题.

Note that I haven't run your example, but the situation I describe above is often the issue when parallization takes longer than running sequentially.

这篇关于为什么并行程序包比仅使用apply慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆