普通与可分配/指针数组,Fortran建议? [英] Plain vs. allocatable/pointer arrays, Fortran advice?
问题描述
我为矩阵乘法编写了以下人为的示例,目的只是为了研究声明不同类型的数组如何影响性能。令我惊讶的是,我发现声明时具有已知大小的普通数组的性能均低于可分配/指针数组。我以为 allocatable
仅对于不适合堆栈的大型阵列才需要。这是同时使用gfortran和Intel Fortran编译器的代码和时序。 Windows 10平台分别与编译器标志 -Ofast
和 -fast
一起使用。
I wrote the following contrived example for matrix multiplication just to examine how declaring different types of arrays can affect the performance. To my surprise, I found that the performance of plain arrays with known sizes at declaration is inferior to both allocatable/pointer arrays. I thought allocatable
was only needed for large arrays that don't fit into the stack. Here is the code and timings using both gfortran and Intel Fortran compilers. Windows 10 platform is used with compiler flags -Ofast
and -fast
, respectively.
program matrix_multiply
implicit none
integer, parameter :: n = 1500
real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n) ! plain arrays
integer :: i, j, k, ts, te, count_rate, count_max
real(8) :: tmp
! real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:) ! allocatable arrays
! allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )
do i = 1,n
do j = 1,n
a(i,j) = 1.d0/n/n * (i-j) * (i+j)
b(i,j) = 1.d0/n/n * (i-j) * (i+j)
end do
end do
! transpose for cache-friendliness
do i = 1,n
do j = 1,n
aT(j,i) = a(i,j)
end do
end do
call system_clock(ts, count_rate, count_max)
do i = 1,n
do j = 1,n
tmp = 0
do k = 1,n
tmp = tmp + aT(k,i) * b(k,j)
end do
c(i,j) = tmp
end do
end do
call system_clock(te)
print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)
end program matrix_multiply
时间如下:
! Intel Fortran
! -------------
Elapsed time: 1.546000, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.417000, c_(n/2+1) = -143.8334 ! Allocatable Arrays
! gfortran:
! -------------
Elapsed time: 1.827999, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.702999, c_(n/2+1) = -143.8334 ! Allocatable Arrays
我的问题是为什么会这样?可分配数组是否为编译器提供了更好的优化保证?在Fortran中处理固定大小的数组时,一般的最佳建议是什么?
My question is why this happens? Do allocatable arrays give the compiler more guarantees to optimize better? What is the best advice in general when dealing with fixed size arrays in Fortran?
冒着延长问题的风险,这是另一个示例,其中Intel Fortran编译器表现出相同的问题行为:
At the risk of lengthening the question, here is another example where Intel Fortran compiler exhibits the same behavior:
program testArrays
implicit none
integer, parameter :: m = 1223, n = 2015
real(8), parameter :: pi = acos(-1.d0)
real(8) :: a(m,n)
real(8), allocatable :: b(:,:)
real(8), pointer :: c(:,:)
integer :: i, sz = min(m, n), t0, t1, count_rate, count_max
allocate( b(m,n), c(m,n) )
call random_seed()
call random_number(a)
call random_number(b)
call random_number(c)
call system_clock(t0, count_rate, count_max)
do i=1,1000
call doit(a,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time plain: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( a(1:3,1:3) )
call system_clock(t0)
do i=1,1000
call doit(b,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time alloc: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( b(1:3,1:3) )
call system_clock(t0)
do i=1,1000
call doitp(c,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time p.ptr: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( c(1:3,1:3) )
contains
subroutine doit(a,sz)
real(8) :: a(:,:)
integer :: sz
a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
end
subroutine doitp(a,sz)
real(8), pointer :: a(:,:)
integer :: sz
a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
end
end program testArrays
fort
时间:
Time plain: 2.857000, sum 3x3 = -.9913536
Time alloc: 2.750000, sum 3x3 = .4471794
Time p.ptr: 2.786000, sum 3x3 = 2.036269
gfortran
的时间是时间更长了,但遵循我的期望:
gfortran
timings, however, are much longer but follow my expectation:
Time plain: 51.5600014, sum 3x3 = 6.2749456118192093
Time alloc: 54.0300007, sum 3x3 = 6.4144775892064283
Time p.ptr: 54.1900034, sum 3x3 = -2.1546109819149963
推荐答案
这不是您获得观测结果的答案,而是关于您的观测结果存在分歧的报告。您的代码
This is not an answer to why you get what you observe, but rather a report of disagreement with your observations. Your code,
program matrix_multiply
implicit none
integer, parameter :: n = 1500
!real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n) ! plain arrays
integer :: i, j, k, ts, te, count_rate, count_max
real(8) :: tmp
real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:) ! allocatable arrays
allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )
do i = 1,n
do j = 1,n
a(i,j) = 1.d0/n/n * (i-j) * (i+j)
b(i,j) = 1.d0/n/n * (i-j) * (i+j)
end do
end do
! transpose for cache-friendliness
do i = 1,n
do j = 1,n
aT(j,i) = a(i,j)
end do
end do
call system_clock(ts, count_rate, count_max)
do i = 1,n
do j = 1,n
tmp = 0
do k = 1,n
tmp = tmp + aT(k,i) * b(k,j)
end do
c(i,j) = tmp
end do
end do
call system_clock(te)
print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)
end program matrix_multiply
在Windows上与Intel Fortran编译器18.0.2编译并打开了优化标志,
compiled with Intel Fortran compiler 18.0.2 on Windows and optimization flags turned on,
ifort /standard-semantics /F0x1000000000 /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive /inline:all /Ob2 main.f90 -o run.exe
实际上与您观察到的相反:
gives, in fact, the opposite of what you observe:
Elapsed time: 1.580000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.560000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.588000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.551000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.566000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.634000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.634000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.602000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.623000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.597000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.607000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.617000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.606000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.626000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.614000, c_(n/2+1) = -143.8334 ! allocatable arrays
实际上,可分配数组实际上稍慢一些,这是我期望看到的结果,这也与您的观察结果相矛盾。我可以看到的唯一差异来源是所使用的优化标志,尽管我不确定这将如何产生差异。也许您想在没有优化且优化水平不同的多种不同模式下运行测试,并查看在所有模式下性能是否保持一致。要获取有关所使用的优化标志含义的更多信息,请参见英特尔参考页。
As you can see, the allocatable arrays are in fact slightly slower, on average, which is what I expected to see, which also contradicts your observations. The only source of difference that I can see is the optimization flags used, though I am not sure how that could make a difference. Perhaps you'd want to run your tests in multiple different modes of no optimization and with different levels of optimization, and see if you get consistent performance differences in all modes or not. To get more info about the meaning of the optimization flags used, see Intel's reference page.
此外,请勿使用 real(8)
用于变量声明。它是非标准语法,不可移植,因此可能存在问题。根据Fortran标准,更一致的方法是使用 iso_fortran_env
固有模块,例如:
Also, do not use real(8)
for variable declarations. It is a non-standard syntax, non-portable, and therefore, potentially problematic. A more consistent way, according to the Fortran standard is to use iso_fortran_env
intrinsic module, like:
!...
use, intrinsic :: iso_fortran_env, only: real64, int32
integer(int32), parameter :: n=100
real(real64) :: a(n)
!...
此内在模块具有以下种类,
This intrinsic module has the following kinds,
int8 ! 8-bit integer
int16 ! 16-bit integer
int32 ! 32-bit integer
int64 ! 64-bit integer
real32 ! 32-bit real
real64 ! 64-bit real
real128 ! 128-bit real
因此,例如,如果您要声明一个包含64个成分的复杂变量位类型,您可以这样写:
So, for example, if you wanted to declare a complex variable with components of 64-bit kind, you could write:
program complex
use, intrinsic :: iso_fortran_env, only: RK => real64, output_unit
! the intrinsic attribute above is not essential, but recommended, so this would be also valid:
! use iso_fortran_env, only: RK => real64, output_unit
complex(RK) :: z = (1._RK, 2._RK)
write(output_unit,"(*(g0,:,' '))") "Hello World! This is a complex variable:", z
end program complex
它给出:
$gfortran -std=f2008 *.f95 -o main
$main
Hello World! This is a complex variable: 1.0000000000000000 2.0000000000000000
请注意,这需要符合Fortran 2008的编译器。 iso_fortran_env
中还有其他功能和实体,例如 output_unit
,这是预连接的标准输出单元的单元号(与 print
或 write
使用的单位相同,单位说明符为 * ),以及
compiler_version()
, compiler_options()
之类的其他内容。
Note that this requires Fortran 2008 compliant compiler. There are also other functions and entities in iso_fortran_env
, like output_unit
which is the unit number for the preconnected standard output unit (the same one that is used by print
or write
with a unit specifier of *
), as well as several others like compiler_version()
, compiler_options()
, and more.
这篇关于普通与可分配/指针数组,Fortran建议?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!