为什么foreach%dopar%随每个其他节点变慢? [英] Why does foreach %dopar% get slower with each additional node?

查看:69
本文介绍了为什么foreach%dopar%随每个其他节点变慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个简单的矩阵乘法来测试网络的多线程/并行化功能,并且发现计算速度比预期的慢得多.

测试很简单:将2个矩阵(4096x4096)相乘并返回计算时间.既不存储矩阵也不存储结果.计算时间并非微不足道(50-90秒,具体取决于您的处理器).

条件:我使用1个处理器重复此计算10次,将这10个计算分为2个处理器(每个5个),然后是3个处理器,...最多10个处理器(1个计算每个处理器).我希望总的计算时间会逐步减少,并且我希望10个处理器完成计算的速度 10倍,就像完成一个处理器所需的时间一样快.

结果:相反,我得到的只是计算时间减少了2倍,比预期的 SLOWER 慢了5倍.

当我计算每个节点的平均计算时间时,我希望每个处理器在相同的时间量(平均)中计算测试,而不管分配的处理器数量如何.我惊讶地发现,仅向多个处理器发送相同的操作就会减慢每个处理器的平均计算时间.

任何人都可以解释为什么会这样吗?

请注意,这是这些问题的重复项:

foreach比循环慢%dopar%

为什么并行程序包比仅使用慢速申请?

因为测试计算并非微不足道(即50-90秒而不是1-2秒),并且因为我能看到的处理器之间没有通信(即除计算时间外没有返回或存储任何结果).

我已附上以下脚本和函数进行复制.

library(foreach); library(doParallel);library(data.table)
# functions adapted from
# http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/BLAS_Testing.html

Matrix.Multiplier <- function(Dimensions=2^12){
  # Creates a matrix of dim=Dimensions and runs multiplication
  #Dimensions=2^12
  m1 <- Dimensions; m2 <- Dimensions; n <- Dimensions;
  z1 <- runif(m1*n); dim(z1) = c(m1,n)
  z2 <- runif(m2*n); dim(z2) = c(m2,n)
  a <- proc.time()[3]
  z3 <- z1 %*% t(z2)
  b <- proc.time()[3]
  c <- b-a
  names(c) <- NULL
  rm(z1,z2,z3,m1,m2,n,a,b);gc()
  return(c)
}

Nodes <- 10
Results <- NULL
for(i in 1:Nodes){
  cl <- makeCluster(i)
  registerDoParallel(cl)
  ptm <- proc.time()[3]
  i.Node.times <- foreach(z=1:Nodes,.combine="c",.multicombine=TRUE, 
                          .inorder=FALSE) %dopar% {
                            t <- Matrix.Multiplier(Dimensions=2^12)
                          }
  etm <- proc.time()[3]
  i.TotalTime <- etm-ptm
  i.Times <- cbind(Operations=Nodes,Node.No=i,Avr.Node.Time=mean(i.Node.times),
                   sd.Node.Time=sd(i.Node.times),
                   Total.Time=i.TotalTime)
  Results <- rbind(Results,i.Times)
  rm(ptm,etm,i.Node.times,i.TotalTime,i.Times)
  stopCluster(cl)
}
library(data.table)
Results <- data.table(Results)
Results[,lower:=Avr.Node.Time-1.96*sd.Node.Time]
Results[,upper:=Avr.Node.Time+1.96*sd.Node.Time]
Exp.Total <- c(Results[Node.No==1][,Avr.Node.Time]*10,
               Results[Node.No==1][,Avr.Node.Time]*5,
               Results[Node.No==1][,Avr.Node.Time]*4,
               Results[Node.No==1][,Avr.Node.Time]*3,
               Results[Node.No==1][,Avr.Node.Time]*2,
               Results[Node.No==1][,Avr.Node.Time]*2,
               Results[Node.No==1][,Avr.Node.Time]*2,
               Results[Node.No==1][,Avr.Node.Time]*2,
               Results[Node.No==1][,Avr.Node.Time]*2,
               Results[Node.No==1][,Avr.Node.Time]*1)
Results[,Exp.Total.Time:=Exp.Total]

jpeg("Multithread_Test_TotalTime_Results.jpeg")
par(oma=c(0,0,0,0)) # set outer margin to zero
par(mar=c(3.5,3.5,2.5,1.5)) # number of lines per margin (bottom,left,top,right)
plot(x=Results[,Node.No],y=Results[,Total.Time],  type="o", xlab="", ylab="",ylim=c(80,900),
     col="blue",xaxt="n", yaxt="n", bty="l")
title(main="Time to Complete 10 Multiplications", line=0,cex.lab=3)
title(xlab="Nodes",line=2,cex.lab=1.2,
      ylab="Total Computation Time (secs)")
axis(2, at=seq(80, 900, by=100), tick=TRUE, labels=FALSE)
axis(2, at=seq(80, 900, by=100), tick=FALSE, labels=TRUE, line=-0.5)
axis(1, at=Results[,Node.No], tick=TRUE, labels=FALSE)
axis(1, at=Results[,Node.No], tick=FALSE, labels=TRUE, line=-0.5)
lines(x=Results[,Node.No],y=Results[,Exp.Total.Time], type="o",col="red")
legend('topright','groups',
       legend=c("Measured", "Expected"), bty="n",lty=c(1,1),
       col=c("blue","red"))
dev.off()

jpeg("Multithread_Test_PerNode_Results.jpeg")
par(oma=c(0,0,0,0)) # set outer margin to zero
par(mar=c(3.5,3.5,2.5,1.5)) # number of lines per margin (bottom,left,top,right)
plot(x=Results[,Node.No],y=Results[,Avr.Node.Time],  type="o", xlab="", ylab="",
     ylim=c(50,500),col="blue",xaxt="n", yaxt="n", bty="l")
title(main="Per Node Multiplication Time", line=0,cex.lab=3)
title(xlab="Nodes",line=2,cex.lab=1.2,
      ylab="Computation Time (secs) per Node")
axis(2, at=seq(50,500, by=50), tick=TRUE, labels=FALSE)
axis(2, at=seq(50,500, by=50), tick=FALSE, labels=TRUE, line=-0.5)
axis(1, at=Results[,Node.No], tick=TRUE, labels=FALSE)
axis(1, at=Results[,Node.No], tick=FALSE, labels=TRUE, line=-0.5)
abline(h=Results[Node.No==1][,Avr.Node.Time], col="red")
epsilon = 0.2
segments(Results[,Node.No],Results[,lower],Results[,Node.No],Results[,upper])
segments(Results[,Node.No]-epsilon,Results[,upper],
         Results[,Node.No]+epsilon,Results[,upper])
segments(Results[,Node.No]-epsilon, Results[,lower],
         Results[,Node.No]+epsilon,Results[,lower])
legend('topleft','groups',
       legend=c("Measured", "Expected"), bty="n",lty=c(1,1),
       col=c("blue","red"))
dev.off()

回复@Hong Ooi的评论

我在UNIX中使用lscpu来获取;

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                30
On-line CPU(s) list:   0-29
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             30
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               2394.455
BogoMIPS:              4788.91
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-29

对@Steve Weston的评论的回复.

我正在使用虚拟机网络(但我不是管理员),最多可以访问30个群集.我进行了您建议的测试.打开5个R会话,并同时在1,2 ... 5上运行矩阵乘法(或尽可能快地切换并执行).得到了与之前非常相似的结果(例如:每个附加过程都会减慢所有单独的会话的速度).请注意,我使用tophtop检查了内存使用情况,并且使用情况从未超过网络容量的5%(〜2.5/64Gb).

结论:

问题似乎与R有关.当我使用其他软件(例如 PLINK )运行其他多线程命令时,我不会运行进入这个问题,并行流程按预期运行.我还尝试了使用RmpidoMPI来运行上述命令,但结果相同(较慢).该问题似乎与虚拟机网络上的R会话/并行命令有关.我真正需要帮助的是如何查明问题.似乎在在这里指出了类似的问题

解决方案

我发现每个节点的乘法时间非常有趣,因为计时不包括与并行循环相关的任何开销,而仅包括执行时间矩阵乘法,结果表明时间随着在同一台机器上并行执行的矩阵乘法次数的增加而增加.

我可以想到可能发生这种情况的两个原因:

  1. 在内核用完之前,计算机的内存带宽被矩阵乘法所饱和;
  2. 矩阵乘法是多线程的.

您可以通过启动多个R会话来测试第一种情况(我在多个终端中都这样做),并在每个会话中创建两个矩阵:

> x <- matrix(rnorm(4096*4096), 4096)
> y <- matrix(rnorm(4096*4096), 4096)

,然后大约在每个时间在每个会话中执行矩阵乘法:

> system.time(z <- x %*% t(y))

理想情况下,无论您使用多少R会话(最多内核数),这次都是相同的,但是由于矩阵乘法是一个相当占用内存的操作,因此许多计算机在使用它们之前会耗尽内存带宽用完核心,导致时间增加.

如果您的R安装是使用多线程数学库(例如MKL或ATLAS)构建的,那么您可能会使用所有内核进行一次矩阵乘法运算,因此无法期望通过使用多个进程获得更好的性能除非您使用多台计算机.

您可以使用"top"之类的工具来查看您是否正在使用多线程数学库.

最后,来自lscpu的输出表明您正在使用虚拟机.我从未在多核虚拟机上进行任何性能测试,但这也可能会引起问题.


更新

我相信并行矩阵乘法运行比单矩阵乘法运行更慢的原因是,您的CPU无法足够快地读取内存以全速提供两个以上的内核,这就是我所说的饱和您的内存带宽.如果您的CPU有足够大的缓存,则可以避免此问题,但实际上与主板上的内存量没有任何关系.

我认为这只是使用一台计算机进行并行计算的限制.使用群集的优点之一是您的内存带宽以及总的聚合内存都会增加.因此,如果您在多节点并行程序的每个节点上运行一两个矩阵乘法,就不会遇到这个特殊问题.

假设您无权访问群集,则可以尝试对计算机上的多线程数学库(例如MKL或ATLAS)进行基准测试.与在多个进程中并行运行一个多线程矩阵乘法相比,您有可能获得更好的性能.但是在同时使用多线程数学库和并行编程包时要小心.

您也可以尝试使用GPU.他们显然擅长执行矩阵乘法.


更新2

要查看问题是否特定于R,我建议您对dgemm函数进行基准测试,该函数是R用于实现矩阵乘法的BLAS函数.

这是一个简单的Fortran程序,用于对dgemm进行基准测试.我建议从多个终端执行它,方法与在R:中对%*%进行基准测试所描述的方式相同.

      program main
      implicit none
      integer n, i, j
      integer*8 stime, etime
      parameter (n=4096)
      double precision a(n,n), b(n,n), c(n,n)
      do i = 1, n
        do j = 1, n
          a(i,j) = (i-1) * n + j
          b(i,j) = -((i-1) * n + j)
          c(i,j) = 0.0d0
        end do
      end do
      stime = time8()
      call dgemm('N','N',n,n,n,1.0d0,a,n,b,n,0.0d0,c,n)
      etime = time8()
      print *, etime - stime
      end

在我的Linux机器上,一个实例在82秒内运行,而四个实例在116秒内运行.这与我在R中看到的结果一致,并且与我猜想这是内存带宽问题相符.

您还可以将此链接链接到不同的BLAS库,以查看哪种实现在您的计算机上效果更好.

您还可以使用 pmbw-并行内存带宽基准测试,获得有关虚拟机网络内存带宽的一些有用信息. ,尽管我从未使用过.

I wrote a simple matrix multiplication to test out multithreading/parallelization capabilities of my network and I noticed that the computation was much slower than expected.

The Test is simple : multiply 2 matrices (4096x4096) and return the computation time. Neither the matrices nor results are stored. The computation time is not trivial (50-90secs depending on your processor).

The Conditions : I repeated this computation 10 times using 1 processor, split these 10 computations to 2 processors (5 each), then 3 processors, ... up to 10 processors (1 computation to each processor). I expected the total computation time to decrease in stages, and i expected 10 processors to complete the computations 10 times as fast as it takes one processor to do the same.

The Results : Instead what i got was only a 2 fold reduction in computation time which is 5 times SLOWER than expected.

When i computed the average computation time per node, i expected each processor to compute the test in the same amount of time (on average) regardless of the number of processors assigned. I was surprised to see that merely sending the same operation to multiple processor was slowing down the average computation time of each processor.

Can anyone explain why this is happening?

Note this is question is NOT a duplicate of these questions:

foreach %dopar% slower than for loop

or

Why is the parallel package slower than just using apply?

Because the test computation is not trivial (ie 50-90secs not 1-2secs), and because there is no communication between processors that i can see (i.e. no results are returned or stored other than the computation time).

I have attached the scripts and functions bellow for replication.

library(foreach); library(doParallel);library(data.table)
# functions adapted from
# http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/BLAS_Testing.html

Matrix.Multiplier <- function(Dimensions=2^12){
  # Creates a matrix of dim=Dimensions and runs multiplication
  #Dimensions=2^12
  m1 <- Dimensions; m2 <- Dimensions; n <- Dimensions;
  z1 <- runif(m1*n); dim(z1) = c(m1,n)
  z2 <- runif(m2*n); dim(z2) = c(m2,n)
  a <- proc.time()[3]
  z3 <- z1 %*% t(z2)
  b <- proc.time()[3]
  c <- b-a
  names(c) <- NULL
  rm(z1,z2,z3,m1,m2,n,a,b);gc()
  return(c)
}

Nodes <- 10
Results <- NULL
for(i in 1:Nodes){
  cl <- makeCluster(i)
  registerDoParallel(cl)
  ptm <- proc.time()[3]
  i.Node.times <- foreach(z=1:Nodes,.combine="c",.multicombine=TRUE, 
                          .inorder=FALSE) %dopar% {
                            t <- Matrix.Multiplier(Dimensions=2^12)
                          }
  etm <- proc.time()[3]
  i.TotalTime <- etm-ptm
  i.Times <- cbind(Operations=Nodes,Node.No=i,Avr.Node.Time=mean(i.Node.times),
                   sd.Node.Time=sd(i.Node.times),
                   Total.Time=i.TotalTime)
  Results <- rbind(Results,i.Times)
  rm(ptm,etm,i.Node.times,i.TotalTime,i.Times)
  stopCluster(cl)
}
library(data.table)
Results <- data.table(Results)
Results[,lower:=Avr.Node.Time-1.96*sd.Node.Time]
Results[,upper:=Avr.Node.Time+1.96*sd.Node.Time]
Exp.Total <- c(Results[Node.No==1][,Avr.Node.Time]*10,
               Results[Node.No==1][,Avr.Node.Time]*5,
               Results[Node.No==1][,Avr.Node.Time]*4,
               Results[Node.No==1][,Avr.Node.Time]*3,
               Results[Node.No==1][,Avr.Node.Time]*2,
               Results[Node.No==1][,Avr.Node.Time]*2,
               Results[Node.No==1][,Avr.Node.Time]*2,
               Results[Node.No==1][,Avr.Node.Time]*2,
               Results[Node.No==1][,Avr.Node.Time]*2,
               Results[Node.No==1][,Avr.Node.Time]*1)
Results[,Exp.Total.Time:=Exp.Total]

jpeg("Multithread_Test_TotalTime_Results.jpeg")
par(oma=c(0,0,0,0)) # set outer margin to zero
par(mar=c(3.5,3.5,2.5,1.5)) # number of lines per margin (bottom,left,top,right)
plot(x=Results[,Node.No],y=Results[,Total.Time],  type="o", xlab="", ylab="",ylim=c(80,900),
     col="blue",xaxt="n", yaxt="n", bty="l")
title(main="Time to Complete 10 Multiplications", line=0,cex.lab=3)
title(xlab="Nodes",line=2,cex.lab=1.2,
      ylab="Total Computation Time (secs)")
axis(2, at=seq(80, 900, by=100), tick=TRUE, labels=FALSE)
axis(2, at=seq(80, 900, by=100), tick=FALSE, labels=TRUE, line=-0.5)
axis(1, at=Results[,Node.No], tick=TRUE, labels=FALSE)
axis(1, at=Results[,Node.No], tick=FALSE, labels=TRUE, line=-0.5)
lines(x=Results[,Node.No],y=Results[,Exp.Total.Time], type="o",col="red")
legend('topright','groups',
       legend=c("Measured", "Expected"), bty="n",lty=c(1,1),
       col=c("blue","red"))
dev.off()

jpeg("Multithread_Test_PerNode_Results.jpeg")
par(oma=c(0,0,0,0)) # set outer margin to zero
par(mar=c(3.5,3.5,2.5,1.5)) # number of lines per margin (bottom,left,top,right)
plot(x=Results[,Node.No],y=Results[,Avr.Node.Time],  type="o", xlab="", ylab="",
     ylim=c(50,500),col="blue",xaxt="n", yaxt="n", bty="l")
title(main="Per Node Multiplication Time", line=0,cex.lab=3)
title(xlab="Nodes",line=2,cex.lab=1.2,
      ylab="Computation Time (secs) per Node")
axis(2, at=seq(50,500, by=50), tick=TRUE, labels=FALSE)
axis(2, at=seq(50,500, by=50), tick=FALSE, labels=TRUE, line=-0.5)
axis(1, at=Results[,Node.No], tick=TRUE, labels=FALSE)
axis(1, at=Results[,Node.No], tick=FALSE, labels=TRUE, line=-0.5)
abline(h=Results[Node.No==1][,Avr.Node.Time], col="red")
epsilon = 0.2
segments(Results[,Node.No],Results[,lower],Results[,Node.No],Results[,upper])
segments(Results[,Node.No]-epsilon,Results[,upper],
         Results[,Node.No]+epsilon,Results[,upper])
segments(Results[,Node.No]-epsilon, Results[,lower],
         Results[,Node.No]+epsilon,Results[,lower])
legend('topleft','groups',
       legend=c("Measured", "Expected"), bty="n",lty=c(1,1),
       col=c("blue","red"))
dev.off()

EDIT : Response @Hong Ooi's comment

I used lscpu in UNIX to get;

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                30
On-line CPU(s) list:   0-29
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             30
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               2394.455
BogoMIPS:              4788.91
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-29

EDIT : Response to @Steve Weston's comment.

I am using a virtual machine network (but I'm not the admin) with access to up to 30 clusters. I ran the test you suggested. Opened up 5 R sessions and ran the matrix multiplication on 1,2...5 simultaneously (or as quickly as i could tab over and execute). Got very similar results to before (re: each additional process slows down all individual sessions). Note i checked memory usage using top and htop and the usage never exceeded 5% of the network capacity (~2.5/64Gb).

CONCLUSIONS:

The problem seems to be R specific. When i run other multi-threaded commands with other software (e.g. PLINK) i don't run into this problem and parallel process run as expected. I have also tried running the above with Rmpi and doMPI with same (slower) results. The problem appears to be related R sessions/parallelized commands on virtual machine network. What i really need help on is how to pinpoint the problem. Similar problem seems to be pointed out here

解决方案

I find the per-node multiplication time very interesting because the timings don't include any of the overhead associated with the parallel loop, but only the time to perform the matrix multiplication, and they show that the time increases with the number of matrix multiplications executing in parallel on the same machine.

I can think of two reasons why that might happen:

  1. The memory bandwidth of the machine is saturated by the matrix multiplications before you run out of cores;
  2. The matrix multiplication is multi-threaded.

You can test for the first situation by starting multiple R sessions (I did this in multiple terminals), creating two matrices in each session:

> x <- matrix(rnorm(4096*4096), 4096)
> y <- matrix(rnorm(4096*4096), 4096)

and then executing a matrix multiplication in each of those sessions at about the same time:

> system.time(z <- x %*% t(y))

Ideally, this time will be the same regardless of the number of R sessions you use (up to the number of cores), but since matrix multiplication is a rather memory intensive operation, many machines will run out of memory bandwidth before they run out of cores, causing the times to increase.

If your R installation was built with a multi-threaded math library, such as MKL or ATLAS, then you could be using all of your cores with a single matrix multiplication, so you can't expect better performance by using multiple processes unless you use multiple computers.

You can use a tool such as "top" to see if you're using a multi-threaded math library.

Finally, the output from lscpu suggests that you're using a virtual machine. I've never done any performance testing on multi-core virtual machines, but that could also be a source of problems.


Update

I believe the reason that your parallel matrix multiplications run more slowly than a single matrix multiplication is that your CPU isn't able to read memory fast enough to feed more than about two cores at full speed, which I referred to as saturating your memory bandwidth. If your CPU had large enough caches, you might be able to avoid this problem, but it doesn't really have anything to do with the amount of memory that you have on your motherboard.

I think this is just a limitation of using a single computer for parallel computations. One of the advantages of using a cluster is that your memory bandwidth goes up as well as your total aggregate memory. So if you ran one or two matrix multiplications on each node of a multi-node parallel program, you wouldn't run into this particular problem.

Assuming you don't have access to a cluster, you could try benchmarking a multi-threaded math library such as MKL or ATLAS on your computer. It's very possible that you could get better performance running one multi-threaded matrix multiply than running them in parallel in multiple processes. But be careful when using both a multi-threaded math library and a parallel programming package.

You could also try using a GPU. They're obviously good at performing matrix multiplications.


Update 2

To see if the problem is R specific, I suggest that you benchmark the dgemm function, which is the BLAS function used by R to implement matrix multiplication.

Here's a simple Fortran program to benchmark dgemm. I suggest executing it from multiple terminals in the same way that I described for benchmarking %*% in R:

      program main
      implicit none
      integer n, i, j
      integer*8 stime, etime
      parameter (n=4096)
      double precision a(n,n), b(n,n), c(n,n)
      do i = 1, n
        do j = 1, n
          a(i,j) = (i-1) * n + j
          b(i,j) = -((i-1) * n + j)
          c(i,j) = 0.0d0
        end do
      end do
      stime = time8()
      call dgemm('N','N',n,n,n,1.0d0,a,n,b,n,0.0d0,c,n)
      etime = time8()
      print *, etime - stime
      end

On my Linux machine, one instance runs in 82 seconds, while four instances run in 116 seconds. This is consistent with the results that I see in R and with my guess that this is a memory bandwidth problem.

You can also link this against different BLAS libraries to see which implementation works better on your machine.

You might also get some useful information about the memory bandwidth of your virtual machine network using pmbw - Parallel Memory Bandwidth Benchmark, although I've never used it.

这篇关于为什么foreach%dopar%随每个其他节点变慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆