在多线程遍历迭代之前和之后,如何减少每次花费的时间? [英] How can I reduce the time foreach take before and after multithreadedly going over the iterations?

查看:98
本文介绍了在多线程遍历迭代之前和之后,如何减少每次花费的时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用foreach + doParallel将函数应用于R中的矩阵多线程的每一行.当矩阵有很多行时,foreach前后需要很长时间多线程遍历迭代.

I use foreach + doParallel to apply a function to each row of a matrix multithreadedly in R. When the matrix has many rows, foreach takes a long time before and after multithreadedly going over the iterations.

例如,如果我运行:

library(foreach)
library(doParallel)

doWork <- function(data) {

  # setup parallel backend to use many processors
  cores=detectCores()
  number_of_cores_to_use = cores[1]-1 # not to overload the computer
  cat(paste('number_of_cores_to_use:',number_of_cores_to_use))
  cl <- makeCluster(number_of_cores_to_use) 
  clusterExport(cl=cl, varlist=c('ns','weights'))
  registerDoParallel(cl)
  cat('...Starting foreach initialization')

  output <- foreach(i=1:length(data[,1]), .combine=rbind) %dopar% {
    cat(i)
    y = data[i,5]
    a = 100
    for (i in 1:3) { # Useless busy work
      b=matrix(runif(a*a), nrow = a, ncol=a)
    }
    return(runif(10))

  }
  # stop cluster
  cat('...Stop cluster')
  stopCluster(cl)

  return(output)
}

r = 100000
c = 10
data = matrix(runif(r*c), nrow = r, ncol=c)
output = doWork(data)
output[1:10,]

CPU使用情况如下(100%表示所有内核均已充分利用):

The CPU usage is as follows (100% means all cores are fully utilized):

带有注释:

如何优化代码,以使foreach在多线程遍历迭代前后不会花费很长时间?主要的时间消耗是花费的时间.后花费的时间随着foreach迭代次数的增加而显着增加,有时会使代码变慢,就好像使用了简单的for循环一样.

How can I optimize the code so that foreach doesn't take a long time before and after multithreadedly going over the iterations? The main time sink is the time spent after. The time spent after grows significantly with the number of foreach iterations, sometimes making the code has slow as if a simple for loop was used.

另一个示例(假设lmpoly不能将矩阵作为参数):

Another example (let's assume lm and poly cannot take matrices as arguments):

library(foreach)
library(doParallel)

doWork <- function(data,weights) {

  # setup parallel backend to use many processors
  cores=detectCores()
  number_of_cores_to_use = cores[1]-1 # not to overload the computer
  cat(paste('number_of_cores_to_use:',number_of_cores_to_use))
  cl <- makeCluster(number_of_cores_to_use) 
  clusterExport(cl=cl, varlist=c('weights'))
  registerDoParallel(cl)
  cat('...Starting foreach initialization')

  output <- foreach(i=1:nrow(data), .combine=rbind) %dopar% {
    x = sort(data[i,])
    fit = lm(x[1:(length(x)-1)] ~ poly(x[-1], degree = 2,raw=TRUE), na.action=na.omit, weights=weights)
    return(fit$coef)
  }
  # stop cluster
  cat('...Stop cluster')
  stopCluster(cl)

  return(output)
}

r = 10000 
c = 10
weights=runif(c-1)
data = matrix(runif(r*c), nrow = r, ncol=c)
output = doWork(data,weights)
output[1:10,]

推荐答案

尝试一下:

devtools::install_github("privefl/bigstatsr")
library(bigstatsr)
options(bigstatsr.ncores.max = parallel::detectCores())

doWork2 <- function(data, weights, ncores = parallel::detectCores() - 1) {

  big_parallelize(data, p.FUN = function(X.desc, ind, weights) {

    X <- bigstatsr::attach.BM(X.desc)

    output.part <- matrix(0, 3, length(ind))
    for (i in seq_along(ind)) {
      x <- sort(X[, ind[i]])
      fit <- lm(x[1:(length(x)-1)] ~ poly(x[-1], degree = 2, raw = TRUE), 
               na.action = na.omit, weights = weights)
      output.part[, i] <- fit$coef
    }

    t(output.part)
  }, p.combine = "rbind", ncores = ncores, weights = weights)
}

system.time({
  data.bm <- as.big.matrix(t(data))
  output2 <- doWork2(data.bm, weights)
})

all.equal(output, output2, check.attributes = FALSE)

这是我的计算机(只有4个内核)的两倍速度.备注:

This is twice as fast on my computer (which has only 4 cores). Remarks:

  • 使用多于一半的内核通常是无用的.
  • 您的数据不是很大,因此在这里使用big.matrix可能没有用.
  • big_parallelizencores列块中分离矩阵,并对每个块应用函数,然后合并结果.
  • 在函数中,最好在循环之前生成输出,然后填充它,而不是使用rbind表示所有结果的结果.
  • 我仅访问列,而不访问行.
  • Using more than half of the cores is often useless.
  • Your data is not very large, so using a big.matrix may not be useful here.
  • big_parallelize separate the matrix in ncores blocks of columns and apply your function on each and then combine the results.
  • In the function, it's better to make the output before the loop, and then fill it than to use a foreach that rbind all the results.
  • I'm accessing only columns, not rows.

因此,所有这些都是好的做法,但这与您的数据并不真正相关.当使用更多核心和更大数据集时,增益应该更高.

So all these are good practices, yet it is not really relevant for your data. The gain should be higher when using more cores and for larger datasets.

基本上,如果您想超快,那么在Rcpp中重新实现lm部分将是一个很好的解决方案.

Basically, if you want to be super fast, reimplementing the lm part in Rcpp would be a good solution.

这篇关于在多线程遍历迭代之前和之后,如何减少每次花费的时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆