FORTRAN:通过指针矩阵访问数组,性能 [英] FORTRAN: Access array via pointer matrix, performance
问题描述
我在使用指针时遇到问题。在我这样做之前,我需要关注性能。假设有一个像这样的2D矩阵:
0.0 0.0 0.0 .....
0.0 0.7 0.5 .. ...
0.0 0.5 0.8 .....
0.0 0.3 0.8 .....
.....
我需要计算这东西的梯度。因此,对于每个数字,我都需要该数字以及该2D矩阵的所有4个最近邻居。除了第一行和最后一行和列均为0。
现在我有两种方法:
-
直接制作一个NxN矩阵并计算梯度。完全按照说明进行操作。这里的内存使用量是NxNxreal * 8,循环从计算(2,2)元素开始,然后是(2,3),...
-
一个(N-2)x(N-2)+1数组和一个NxN指针矩阵(此刻使用类型)。数组的(N-2)x(N-2)个元素将在边框上存储除0.0以外的数字。的最后一个元素是0.0。对于指针矩阵,边界板上的所有元素都将指向数组的最后一个元素0.0。其他指针应指向它们应该指向的位置。
这里出现了性能问题,因为我要处理的矩阵可能很大,甚至可能是3D。> p>
对于方法1,没有什么可说的,因为它只是一种简单的方法。
对于方法2,我想知道编译器是否可以正确处理该问题。到目前为止,根据我的理解,每个FORTRAN指针都像一个结构。如果是这种情况, FORTRAN指针比c指针慢,因为它不仅是简单的取消引用??我还想知道指针的类型扭曲是否会降低性能 strong>(需要使用扭曲才能生成指针矩阵)。是我放弃方法2的一个特殊原因,因为它应该更慢吗?
让我们以Windows上的IVF,Linux上的gfortran和ifort为例。
更新:
感谢Stefan的代码。
程序模具
隐式无
类型pp
实数* 8,指针:: ptr
endtype pp
type(pp),可分配的:: parray(:, :)
实数* 8,可分配的,目标::数组(:)
实数* 8,可分配:: grad(:,:,:),direct(:, :)
整数,参数:: n = 5000
整数:: i,j
整数:: Clock_rate,clock_start,clock_stop
allocate(array(n ** 2 + 1))
allocate(parray(0:n + 1,0:n + 1) )
allocate(grad(2,n,n))
调用random_number(array)
array(n ** 2 + 1)= 0
do i = 0,n + 1
parray(0,i)%ptr => array(n ** 2 + 1)
parray(n + 1,i)%ptr => array(n ** 2 + 1)
parray(i,0)%ptr => array(n ** 2 + 1)
parray(i,n + 1)%ptr => array(n ** 2 + 1)
enddo
做i = 1,n
do j = 1,n
parray(i,j)%ptr => array((i-1)* n + j)
enddo
enddo
!now模具
调用system_clock(count_rate = clock_rate)
调用system_clock(count = clock_start )
做j = 1,n
做i = 1,n
grad(1,i,j)=(parray(i + 1,j)%ptr-parray(i- 1,j)%ptr)/2.D0
grad(2,i,j)=(parray(i,j + 1)%ptr-parray(i,j-1)%ptr)/ 2。 D0
enddo
enddo
调用system_clock(count = clock_stop)
print *, pointer,time cost =,real(clock_stop-clock_start)/ real(clock_rate)
deallocate(array)
deallocate(parray)
allocate(direct(0:n + 1,0:n + 1))
调用random_number(direct)
我= 0,n + 1
direct(0,i)= 0
direct(n + 1,i)= 0
direct(i,0)= 0
direct( i,n + 1)= 0
enddo
!现在直接模具
调用system_clock(count_rate = clock_rate)
调用system_clock(count = clock_start)
做j = 1,n
i = 1,n
grad(1,i,j)=(direct(i + 1,j)-direct(i-1,j))/ 2.D0
grad(2, i,j)=(直接(i,j + 1)-直接(i,j-1))/ 2.D0
enddo
enddo
调用system_clock(count = clock_stop)
print *, direct,time cost =,real(clock_stop-clock_start)/ real(clock_rate)
最终程序模具
结果(o0):
指针,时间成本= 2.170000
直接,时间成本= 1.127000
结果(o2):
指针,时间成本= 0.5110000
直接,时间成本= 9.4999999E-02
所以FORTRAN指针慢得多。斯蒂芬已经指出了这一点。现在,我想知道是否还有改进的空间。据我所知,如果我用c做到这一点,那么差别应该不会太大。
起初,我不得不道歉,因为我误解了方式,指针在Fortran中工作...
最后,我对这个话题很感兴趣,以至于我自己创建了一个测试。它基于一个数组,该数组周围有零。
声明:
实数,维(:,:),可分配,目标::数组
实数,维(:,:,:),可分配:: res
实数,维(:,:),指针:: p1,p2,p3,p4
allocate(array(0:n + 1,0:n + 1),source = 0。)
(res(n,n,2),来源=0。)
现在的方法是:
环路:
do j = 1 ,n
i = 1,n
res(i,j,1)= array(i + 1,j)-array(i-1,j)
res(i, j,2)= array(i,j + 1)-array(i,j-1)
end do
end do
数组分配:
res(: ,:,1)= array(2:n + 1,1:n)-array(0:n-1,1:n)
res(:,:,2)= array(1:n, 2:n + 1)-数组(1:n,0:n-1)
指针:
p1 => array(0:n-1,1:n)
p2 => array(1:n,2:n + 1)
p3 => array(2:n + 1,1:n)
p4 => array(1:n,0:n-1)
res(:,:,1)= p3-p1
res(:,:,2)= p2-p4
尽管后两种方法确实依赖于额外的零层,但是循环可以引入一些条件来照顾这些。 p>
时间安排很有趣:
循环:0.17528710301849060
数组:0.21127231500577182
指针:0.21367537401965819
而数组和指针分配产生大致相同的时间,循环构造(注意循环顺序!这是5的一个因素!!)是最快的方法。
更新:我试图从您的代码中压缩一些性能,然后发现了一件小事。您的代码在 0.95s
和 0.30s
-O2 执行>(<< c $ c> n = 10000 )。
对矩阵进行转置以获得更线性的内存访问会产生一个指针部分 0.50s
。
parray(i,j) %ptr => array((j-1)* n + i)
恕我直言,问题是缺少有关指针,禁止进行其他优化。使用 -O3 -fopt-info-missed
,您将收到有关未知对齐方式和非连续访问的投诉。与我的结果相比,额外的系数2应该来自以下事实:您使用的是双精度,而我的代码是以单精度编写的。
I am having a issue here for using pointers. Before I do that I have a performance concern. Suppose there is a 2D matrix like this:
0.0 0.0 0.0.....
0.0 0.7 0.5.....
0.0 0.5 0.8.....
0.0 0.3 0.8.....
.....
And I need to calculate the gradient of this thing. Therefore, for each number, I'll need the number as well as all its 4 nearest neighbors of this 2D matrix. Besides the first and last row and column are 0.
Now I have two method:
Make such a NxN matrix directly and calculate the gradient. Exactly follow the description. Here the memory use is NxNxreal*8, The loop start from calculating the (2,2) element then (2,3), ...
Make a (N-2)x(N-2)+1 array, and a NxN pointer matrix (use type at the moment). the (N-2)x(N-2) elements of the array will store the numbers except the 0.0s on the border. The last element of the is 0.0. For the pointer matrix, all the elements on the boarder will point to the last element of the array, 0.0. Other pointers should point to the places where they suppose to point to.
Here comes a issue of performance since the matrix that I am handling can be really huge or maybe 3D.
For method 1, there is nothing to say since it is just a straight forward method.
For method 2, I am wondering if the compiler can handle the issue properly. Since each FORTRAN pointer is like a structure according to my understanding so far. IF that is the case, FORTRAN pointer is slower than c pointer since it is not just a simple de-reference? I am also wondering if the type warp of the pointer decrease the performance (that warp is needed to make a pointer matrix). Is ther a particular reason that I should give up method 2 since it should be slower?
Let's take IVF on windows, gfortran and ifort on Linux for example. Since it can be compiler dependent.
UPDATE: Appreciate Stefan's code. I wrote on by my self as well.
program stencil
implicit none
type pp
real*8, pointer :: ptr
endtype pp
type(pp), allocatable :: parray(:,:)
real*8, allocatable, target :: array(:)
real*8, allocatable :: grad(:,:,:), direct(:,:)
integer, parameter :: n = 5000
integer :: i, j
integer :: clock_rate, clock_start, clock_stop
allocate(array(n**2+1))
allocate(parray(0:n+1, 0:n+1))
allocate(grad(2, n, n))
call random_number(array)
array(n**2+1) = 0
do i = 0, n + 1
parray(0,i)%ptr => array(n**2+1)
parray(n+1,i)%ptr => array(n**2+1)
parray(i,0)%ptr => array(n**2+1)
parray(i,n+1)%ptr => array(n**2+1)
enddo
do i = 1, n
do j = 1, n
parray(i,j)%ptr => array((i-1) * n + j)
enddo
enddo
!now stencil
call system_clock(count_rate=clock_rate)
call system_clock(count=clock_start)
do j = 1, n
do i = 1, n
grad(1, i, j) = (parray(i + 1,j)%ptr - parray(i - 1,j)%ptr)/2.D0
grad(2, i, j) = (parray(i,j + 1)%ptr - parray(i,j - 1)%ptr)/2.D0
enddo
enddo
call system_clock(count=clock_stop)
print *, "pointer, time cost= ", real(clock_stop-clock_start)/real(clock_rate)
deallocate(array)
deallocate(parray)
allocate(direct(0:n+1, 0:n+1))
call random_number(direct)
do i = 0, n + 1
direct(0,i) = 0
direct(n+1,i) = 0
direct(i,0) = 0
direct(i,n+1) = 0
enddo
!now stencil directly
call system_clock(count_rate=clock_rate)
call system_clock(count=clock_start)
do j = 1, n
do i = 1, n
grad(1, i, j) = (direct(i + 1,j) - direct(i - 1,j))/2.D0
grad(2, i, j) = (direct(i,j + 1) - direct(i,j - 1))/2.D0
enddo
enddo
call system_clock(count=clock_stop)
print *, "direct, time cost= ", real(clock_stop-clock_start)/real(clock_rate)
endprogram stencil
result (o0):
pointer, time cost= 2.170000
direct, time cost= 1.127000
result (o2):
pointer, time cost= 0.5110000
direct, time cost= 9.4999999E-02
So FORTRAN pointer is much slower. Stefan has pointed that out earlier. Now I am wondering if there is room for improvement. As I know so far, if I did it with c, the difference should not be this much.
At first, I have to apologize, because I misunderstood the way, pointer work in Fortran...
Finally, I was so intrigued from the topic, that I created a test on my own. It is based on an array, which has a surrounding for zeros.
Declaration:
real, dimension(:,:), allocatable, target :: array
real, dimension(:,:,:), allocatable :: res
real, dimension(:,:), pointer :: p1, p2, p3, p4
allocate(array(0:n+1, 0:n+1), source=0.)
allocate(res(n,n,2), source=0.)
Now the methods:
Loops:
do j = 1, n
do i = 1, n
res(i,j,1) = array(i+1,j) - array(i-1,j)
res(i,j,2) = array(i,j+1) - array(i,j-1)
end do
end do
Array assignment:
res(:,:,1) = array(2:n+1,1:n) - array(0:n-1,1:n)
res(:,:,2) = array(1:n,2:n+1) - array(1:n,0:n-1)
Pointers:
p1 => array(0:n-1,1:n)
p2 => array(1:n,2:n+1)
p3 => array(2:n+1,1:n)
p4 => array(1:n,0:n-1)
res(:,:,1) = p3 - p1
res(:,:,2) = p2 - p4
While the last two methods do rely on the extra layer of zeros, the loops can introduce some conditionals to care for these.
The timings are interesting:
loops: 0.17528710301849060
array: 0.21127231500577182
pointers: 0.21367537401965819
While the array and pointer assignments yield approximately the same timings, the loop construct (mind the loop order! this was a factor of 5!!!) is the fastest method.
UPDATE: I tried to squeeze a bit of performance out of your code and found one small thing. Your code performs with -O2
in 0.95s
and 0.30s
(with n = 10000
).
Transposing your matrix to get a more linear memory access yields a runtime of 0.50s
for the pointer part.
parray(i,j)%ptr => array((j-1) * n + i)
IMHO, the problem is the missing information about the pointers, which forbid additional optimization. Using -O3 -fopt-info-missed
you get complaints about unknown alignment and non-consecutive accesses. The additional factor 2 compared to my results should stem from the fact, that you are using double precision, while my code is written in single precision...
这篇关于FORTRAN:通过指针矩阵访问数组,性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!