同时执行多个进程时速度急剧下降 [英] Dramatic slow-down when executing multiple processes at the same time

查看:149
本文介绍了同时执行多个进程时速度急剧下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Fortran和Python编写了一个非常简单的代码,其中包含数组的总和.当我使用Shell提交多个(独立)作业时,线程数大于一个时,速度会急剧下降.

I write a very simple code which contains summation of arrays by using both Fortran and Python. When I submit multiple (independent) jobs using shell, there will be dramatic slow-down when the number of threads is larger than one.

我的代码的Fortran版本如下所示

The Fortran version of my code is presented as follows

program main
implicit none
real*8 begin, end, Ht(2, 2), ls(4)
integer i, j, k, ii, jj, kk
integer,parameter::N_tiles = 20
integer,parameter::N_tilings = 100
integer,parameter::max_t_steps = 50
real*8,dimension(N_tiles*N_tilings,max_t_steps,5)::test_e, test_theta
real*8 rand_val

call random_seed()
do i = 1, N_tiles*N_tilings
  do j = 1, max_t_steps
    do k = 1, 5
      call random_number(rand_val)
      test_e(i, j, k) = rand_val
      call random_number(rand_val)
      test_theta(i, j, k) = rand_val
    end do
  end do
end do

call CPU_TIME(begin)
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
call CPU_TIME(end)

write(*, *) 'total time cost is : ', end-begin

end program main

shell-scipt表示如下

#!/bin/bash
gfortran -o result test.f90

nohup ./result &
nohup ./result &
nohup ./result &

我们可以看到,主要操作是数组test_thetatest_e的总和.这些阵列不大(大约3MB),我的计算机的内存空间足以完成此任务.我的工作站有6个核心和12个线程.我尝试一次使用shell提交1、2、3、4和5个作业,时间成本如下所示

As we can see, the main operation is the summation of array test_theta and test_e. These arrays are not large (3MB approximately) and the memory space of my computer is enough for this job. My work station has 6 cores with 12 threads. I try to submit 1, 2, 3, 4 and 5 jobs by using shell at one time, and the cost of time is presented as follows

| #jobs   |  1   |   2   |   3    |  4    |  5   |
| time(s) |  21  |   31  |   161  |  237  |  357 | 

我希望一旦线程数小于我们拥有的内核数,n线程作业的时间应与单线程作业的时间相同,这对于我的计算机来说是6.但是,我们发现这里的运行速度急剧下降.

I expect that the time for n-thread job should be the same as the single-thread job once the number of threads is smaller than the number of cores we have, which is 6 here for my computer. However, we find dramatic slow-down here.

当我使用Python来执行相同的任务时,此问题仍然存在

This problem still exists when I use Python to implement the same task

import numpy as np 
import time

N_tiles = 20
N_tilings = 100
max_t_steps = 50
theta = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
e = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)

begin = time.clock()

for i in range(1001):
    for j in range(50):
        theta += 0.5*e

end = time.clock()
print('total time cost is {} s'.format(end-begin))

我不知道原因,我想知道它是否与CPU的L3缓存大小有关.也就是说,对于这样的多线程作业,缓存太小.也许还与所谓的虚假共享"问题有关.我该如何解决?

I don't know the reason and I wonder whether it is related to the size of L3 cache of CPU. That is, cache is too small for such multi-thread job. Maybe it is also related to the so-called "false sharing" problem. How can I fix this ?

此问题与以前的问题有关剧烈的减速在python中使用multiprocess和numpy ,在这里我只发布一个简单而典型的示例.

This question is related to a former one dramatic slow down using multiprocess and numpy in python and here I just post a simple and typical example.

推荐答案

多次运行时,代码可能会变慢,因为您越来越多的内存必须流经有限带宽的内存总线.

The code is likely slow when running multiple times, because you have more and more memory that must flow through the limited bandwidth memory buses.

如果只运行一个进程,一次只能处理一个数组,但是启用了OpenMP线程,则可以使其更快:

If you run just one process, that works just with one array at one time, but enable OpenMP threading, it can be made faster:

integer*8 :: begin, end, rate
...

call system_clock(count_rate=rate)
call system_clock(count=begin)

!$omp parallel do
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
!$omp end parallel do

call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

在四核CPU上:

> gfortran -O3 testperformance.f90 -o result
> ./result 
 total time cost is :    15.135917384000001
> gfortran -O3 testperformance.f90 -fopenmp -o result
> ./result 
 total time cost is :    3.9464441830000001

这篇关于同时执行多个进程时速度急剧下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆