普通与可分配/指针数组,Fortran建议? [英] Plain vs. allocatable/pointer arrays, Fortran advice?

查看:98
本文介绍了普通与可分配/指针数组,Fortran建议?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为矩阵乘法编写了以下人为的示例,目的只是为了研究声明不同类型的数组如何影响性能。令我惊讶的是,我发现声明时具有已知大小的普通数组的性能均低于可分配/指针数组。我以为 allocatable 仅对于不适合堆栈的大型阵列才需要。这是同时使用gfortran和Intel Fortran编译器的代码和时序。 Windows 10平台分别与编译器标志 -Ofast -fast 一起使用。

I wrote the following contrived example for matrix multiplication just to examine how declaring different types of arrays can affect the performance. To my surprise, I found that the performance of plain arrays with known sizes at declaration is inferior to both allocatable/pointer arrays. I thought allocatable was only needed for large arrays that don't fit into the stack. Here is the code and timings using both gfortran and Intel Fortran compilers. Windows 10 platform is used with compiler flags -Ofast and -fast, respectively.

program matrix_multiply
   implicit none
   integer, parameter :: n = 1500
   real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n)                 ! plain arrays 
   integer :: i, j, k, ts, te, count_rate, count_max
   real(8) :: tmp

   ! real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:)  ! allocatable arrays
   ! allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )

   do i = 1,n
      do j = 1,n
         a(i,j) = 1.d0/n/n * (i-j) * (i+j)
         b(i,j) = 1.d0/n/n * (i-j) * (i+j)
      end do 
   end do 

   ! transpose for cache-friendliness   
   do i = 1,n
      do j = 1,n
         aT(j,i) = a(i,j)
      end do 
   end do 

   call system_clock(ts, count_rate, count_max)
   do i = 1,n
      do j = 1,n
         tmp = 0 
         do k = 1,n
            tmp = tmp + aT(k,i) * b(k,j)
         end do
         c(i,j) = tmp
      end do
   end do
   call system_clock(te)
   print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)    
end program matrix_multiply

时间如下:

! Intel Fortran
! -------------
Elapsed time: 1.546000, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.417000, c_(n/2+1) = -143.8334 ! Allocatable Arrays  

! gfortran:
! -------------
Elapsed time: 1.827999, c_(n/2+1) = -143.8334 ! Plain Arrays 
Elapsed time: 1.702999, c_(n/2+1) = -143.8334 ! Allocatable Arrays

我的问题是为什么会这样?可分配数组是否为编译器提供了更好的优化保证?在Fortran中处理固定大小的数组时,一般的最佳建议是什么?

My question is why this happens? Do allocatable arrays give the compiler more guarantees to optimize better? What is the best advice in general when dealing with fixed size arrays in Fortran?

冒着延长问题的风险,这是另一个示例,其中Intel Fortran编译器表现出相同的问题行为:

At the risk of lengthening the question, here is another example where Intel Fortran compiler exhibits the same behavior:

program testArrays
  implicit none
  integer, parameter :: m = 1223, n = 2015 
  real(8), parameter :: pi = acos(-1.d0)
  real(8) :: a(m,n)
  real(8), allocatable :: b(:,:)
  real(8), pointer :: c(:,:)
  integer :: i, sz = min(m, n), t0, t1, count_rate, count_max

  allocate( b(m,n), c(m,n) )
  call random_seed()
  call random_number(a)
  call random_number(b)
  call random_number(c)

  call system_clock(t0, count_rate, count_max)
    do i=1,1000
      call doit(a,sz)
    end do 
  call system_clock(t1)
  print '(4g0)', 'Time plain: ', real(t1-t0)/count_rate, ',  sum 3x3 = ', sum( a(1:3,1:3) )

  call system_clock(t0)
    do i=1,1000
      call doit(b,sz)
    end do 
  call system_clock(t1)
  print '(4g0)', 'Time alloc: ', real(t1-t0)/count_rate, ',  sum 3x3 = ', sum( b(1:3,1:3) )

  call system_clock(t0)
    do i=1,1000 
      call doitp(c,sz)
    end do 
  call system_clock(t1)
  print '(4g0)', 'Time p.ptr: ', real(t1-t0)/count_rate, ',  sum 3x3 = ', sum( c(1:3,1:3) )

  contains 
  subroutine doit(a,sz)
    real(8) :: a(:,:)
    integer :: sz 
    a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
  end

  subroutine doitp(a,sz)
    real(8), pointer :: a(:,:)
    integer :: sz
    a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
  end    
end program testArrays 

fort 时间:

Time plain: 2.857000,  sum 3x3 = -.9913536
Time alloc: 2.750000,  sum 3x3 = .4471794
Time p.ptr: 2.786000,  sum 3x3 = 2.036269  

gfortran 的时间是时间更长了,但遵循我的期望:

gfortran timings, however, are much longer but follow my expectation:

Time plain: 51.5600014,  sum 3x3 = 6.2749456118192093
Time alloc: 54.0300007,  sum 3x3 = 6.4144775892064283
Time p.ptr: 54.1900034,  sum 3x3 = -2.1546109819149963


推荐答案

这不是您获得观测结果的答案,而是关于您的观测结果存在分歧的报告。您的代码

This is not an answer to why you get what you observe, but rather a report of disagreement with your observations. Your code,

program matrix_multiply
   implicit none
   integer, parameter :: n = 1500
  !real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n)                 ! plain arrays 
   integer :: i, j, k, ts, te, count_rate, count_max
   real(8) :: tmp

   real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:)  ! allocatable arrays
   allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )

   do i = 1,n
      do j = 1,n
         a(i,j) = 1.d0/n/n * (i-j) * (i+j)
         b(i,j) = 1.d0/n/n * (i-j) * (i+j)
      end do 
   end do 

   ! transpose for cache-friendliness   
   do i = 1,n
      do j = 1,n
         aT(j,i) = a(i,j)
      end do 
   end do 

   call system_clock(ts, count_rate, count_max)
   do i = 1,n
      do j = 1,n
         tmp = 0 
         do k = 1,n
            tmp = tmp + aT(k,i) * b(k,j)
         end do
         c(i,j) = tmp
      end do
   end do
   call system_clock(te)
   print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)    
end program matrix_multiply

在Windows上与Intel Fortran编译器18.0.2编译并打开了优化标志,

compiled with Intel Fortran compiler 18.0.2 on Windows and optimization flags turned on,

ifort /standard-semantics /F0x1000000000 /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive /inline:all /Ob2 main.f90 -o run.exe

实际上与您观察到的相反:

gives, in fact, the opposite of what you observe:

Elapsed time: 1.580000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.560000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.588000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.551000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.566000, c_(n/2+1) = -143.8334   ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334   ! plain arrays

Elapsed time: 1.634000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.634000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.602000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.623000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.597000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.607000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.617000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.606000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.626000, c_(n/2+1) = -143.8334   ! allocatable arrays
Elapsed time: 1.614000, c_(n/2+1) = -143.8334   ! allocatable arrays

实际上,可分配数组实际上稍慢一些,这是我期望看到的结果,这也与您的观察结果相矛盾。我可以看到的唯一差异来源是所使用的优化标志,尽管我不确定这将如何产生差异。也许您想在没有优化且优化水平不同的多种不同模式下运行测试,并查看在所有模式下性能是否保持一致。要获取有关所使用的优化标志含义的更多信息,请参见英特尔参考页

As you can see, the allocatable arrays are in fact slightly slower, on average, which is what I expected to see, which also contradicts your observations. The only source of difference that I can see is the optimization flags used, though I am not sure how that could make a difference. Perhaps you'd want to run your tests in multiple different modes of no optimization and with different levels of optimization, and see if you get consistent performance differences in all modes or not. To get more info about the meaning of the optimization flags used, see Intel's reference page.

此外,请勿使用 real(8) 用于变量声明。它是非标准语法,不可移植,因此可能存在问题。根据Fortran标准,更一致的方法是使用 iso_fortran_env 固有模块,例如:

Also, do not use real(8) for variable declarations. It is a non-standard syntax, non-portable, and therefore, potentially problematic. A more consistent way, according to the Fortran standard is to use iso_fortran_env intrinsic module, like:

!...
use, intrinsic :: iso_fortran_env, only: real64, int32
integer(int32), parameter :: n=100
real(real64) :: a(n)
!...

此内在模块具有以下种类,

This intrinsic module has the following kinds,

   int8 ! 8-bit integer
  int16 ! 16-bit integer
  int32 ! 32-bit integer
  int64 ! 64-bit integer
 real32 ! 32-bit real
 real64 ! 64-bit real
real128 ! 128-bit real 

因此,例如,如果您要声明一个包含64个成分的复杂变量位类型,您可以这样写:

So, for example, if you wanted to declare a complex variable with components of 64-bit kind, you could write:

program complex
    use, intrinsic :: iso_fortran_env, only: RK => real64, output_unit
    ! the intrinsic attribute above is not essential, but recommended, so this would be also valid:
    ! use iso_fortran_env, only: RK => real64, output_unit
    complex(RK) :: z = (1._RK, 2._RK)
    write(output_unit,"(*(g0,:,' '))") "Hello World! This is a complex variable:", z
end program complex

它给出:

$gfortran -std=f2008 *.f95 -o main
$main
Hello World! This is a complex variable: 1.0000000000000000 2.0000000000000000

请注意,这需要符合Fortran 2008的编译器。 iso_fortran_env 中还有其他功能和实体,例如 output_unit ,这是预连接的标准输出单元的单元号(与 print write 使用的单位相同,单位说明符为 * ),以及 compiler_version() compiler_options()之类的其他内容。

Note that this requires Fortran 2008 compliant compiler. There are also other functions and entities in iso_fortran_env, like output_unit which is the unit number for the preconnected standard output unit (the same one that is used by print or write with a unit specifier of *), as well as several others like compiler_version(), compiler_options(), and more.

这篇关于普通与可分配/指针数组,Fortran建议?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆