FORTRAN:通过指针矩阵访问数组,性能 [英] FORTRAN: Access array via pointer matrix, performance

查看:122
本文介绍了FORTRAN:通过指针矩阵访问数组,性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用指针时遇到问题。在我这样做之前,我需要关注性能。假设有一个像这样的2D矩阵:

  0.0 0.0 0.0 ..... 
0.0 0.7 0.5 .. ...
0.0 0.5 0.8 .....
0.0 0.3 0.8 .....

.....



我需要计算这东西的梯度。因此,对于每个数字,我都需要该数字以及该2D矩阵的所有4个最近邻居。除了第一行和最后一行和列均为0。



现在我有两种方法:


  1. 直接制作一个NxN矩阵并计算梯度。完全按照说明进行操作。这里的内存使用量是NxNxreal * 8,循环从计算(2,2)元素开始,然后是(2,3),...


  2. 一个(N-2)x(N-2)+1数组和一个NxN指针矩阵(此刻使用类型)。数组的(N-2)x(N-2)个元素将在边框上存储除0.0以外的数字。的最后一个元素是0.0。对于指针矩阵,边界板上的所有元素都将指向数组的最后一个元素0.0。其他指针应指向它们应该指向的位置。


这里出现了性能问题,因为我要处理的矩阵可能很大,甚至可能是3D。 p>

对于方法1,没有什么可说的,因为它只是一种简单的方法。



对于方法2,我想知道编译器是否可以正确处理该问题。到目前为止,根据我的理解,每个FORTRAN指针都像一个结构。如果是这种情况, FORTRAN指针比c指针慢,因为它不仅是简单的取消引用??我还想知道指针的类型扭曲是否会降低性能 strong>(需要使用扭曲才能生成指针矩阵)。是我放弃方法2的一个特殊原因,因为它应该更慢吗?



让我们以Windows上的IVF,Linux上的gfortran和ifort为例。



更新:
感谢Stefan的代码。

 程序模具
隐式无
类型pp
实数* 8,指针:: ptr
endtype pp
type(pp),可分配的:: parray(:, :)
实数* 8,可分配的,目标::数组(:)
实数* 8,可分配:: grad(:,:,:),direct(:, :)
整数,参数:: n = 5000
整数:: i,j
整数:: Clock_rate,clock_start,clock_stop

allocate(array(n ** 2 + 1))
allocate(parray(0:n + 1,0:n + 1) )
allocate(grad(2,n,n))
调用random_number(array)
array(n ** 2 + 1)= 0
do i = 0,n + 1
parray(0,i)%ptr => array(n ** 2 + 1)
parray(n + 1,i)%ptr => array(n ** 2 + 1)
parray(i,0)%ptr => array(n ** 2 + 1)
parray(i,n + 1)%ptr => array(n ** 2 + 1)
enddo
做i = 1,n
do j = 1,n
parray(i,j)%ptr => array((i-1)* n + j)
enddo
enddo
!now模具
调用system_clock(count_rate = clock_rate)
调用system_clock(count = clock_start )
做j = 1,n
做i = 1,n
grad(1,i,j)=(parray(i + 1,j)%ptr-parray(i- 1,j)%ptr)/2.D0
grad(2,i,j)=(parray(i,j + 1)%ptr-parray(i,j-1)%ptr)/ 2。 D0
enddo
enddo
调用system_clock(count = clock_stop)
print *, pointer,time cost =,real(clock_stop-clock_start)/ real(clock_rate)
deallocate(array)
deallocate(parray)
allocate(direct(0:n + 1,0:n + 1))
调用random_number(direct)
我= 0,n + 1
direct(0,i)= 0
direct(n + 1,i)= 0
direct(i,0)= 0
direct( i,n + 1)= 0
enddo
!现在直接模具
调用system_clock(count_rate = clock_rate)
调用system_clock(count = clock_start)
做j = 1,n
i = 1,n
grad(1,i,j)=(direct(i + 1,j)-direct(i-1,j))/ 2.D0
grad(2, i,j)=(直接(i,j + 1)-直接(i,j-1))/ 2.D0
enddo
enddo
调用system_clock(count = clock_stop)
print *, direct,time cost =,real(clock_stop-clock_start)/ real(clock_rate)
最终程序模具

结果(o0):



指针,时间成本= 2.170000



直接,时间成本= 1.127000



结果(o2):



指针,时间成本= 0.5110000



直接,时间成本= 9.4999999E-02



所以FORTRAN指针慢得多。斯蒂芬已经指出了这一点。现在,我想知道是否还有改进的空间。据我所知,如果我用c做到这一点,那么差别应该不会太大。

解决方案

起初,我不得不道歉,因为我误解了方式,指针在Fortran中工作...






最后,我对这个话题很感兴趣,以至于我自己创建了一个测试。它基于一个数组,该数组周围有零。



声明:

 实数,维(:,:),可分配,目标::数组
实数,维(:,:,:),可分配:: res
实数,维(:,:),指针:: p1,p2,p3,p4
allocate(array(0:n + 1,0:n + 1),source = 0。)
(res(n,n,2),来源=0。)

现在的方法是:



环路:

  do j = 1 ,n 
i = 1,n
res(i,j,1)= array(i + 1,j)-array(i-1,j)
res(i, j,2)= array(i,j + 1)-array(i,j-1)
end do
end do

数组分配:

  res(: ,:,1)= array(2:n + 1,1:n)-array(0:n-1,1:n)
res(:,:,2)= array(1:n, 2:n + 1)-数组(1:n,0:n-1)

指针:

  p1 => array(0:n-1,1:n)
p2 => array(1:n,2:n + 1)
p3 => array(2:n + 1,1:n)
p4 => array(1:n,0:n-1)
res(:,:,1)= p3-p1
res(:,:,2)= p2-p4

尽管后两种方法确实依赖于额外的零层,但是循环可以引入一些条件来照顾这些。 p>

时间安排很有趣:

 循环:0.17528710301849060 
数组:0.21127231500577182
指针:0.21367537401965819

而数组和指针分配产生大致相同的时间,循环构造(注意循环顺序!这是5的一个因素!!)是最快的方法。






更新:我试图从您的代码中压缩一些性能,然后发现了一件小事。您的代码在 0.95s 0.30s -O2 执行>(<< c $ c> n = 10000 )。



对矩阵进行转置以获得更线性的内存访问会产生一个指针部分 0.50s

  parray(i,j) %ptr => array((j-1)* n + i)

恕我直言,问题是缺少有关指针,禁止进行其他优化。使用 -O3 -fopt-info-missed ,您将收到有关未知对齐方式和非连续访问的投诉。与我的结果相比,额外的系数2应该来自以下事实:您使用的是双精度,而我的代码是以单精度编写的。


I am having a issue here for using pointers. Before I do that I have a performance concern. Suppose there is a 2D matrix like this:

0.0  0.0  0.0.....
0.0  0.7  0.5.....
0.0  0.5  0.8.....
0.0  0.3  0.8.....

.....

And I need to calculate the gradient of this thing. Therefore, for each number, I'll need the number as well as all its 4 nearest neighbors of this 2D matrix. Besides the first and last row and column are 0.

Now I have two method:

  1. Make such a NxN matrix directly and calculate the gradient. Exactly follow the description. Here the memory use is NxNxreal*8, The loop start from calculating the (2,2) element then (2,3), ...

  2. Make a (N-2)x(N-2)+1 array, and a NxN pointer matrix (use type at the moment). the (N-2)x(N-2) elements of the array will store the numbers except the 0.0s on the border. The last element of the is 0.0. For the pointer matrix, all the elements on the boarder will point to the last element of the array, 0.0. Other pointers should point to the places where they suppose to point to.

Here comes a issue of performance since the matrix that I am handling can be really huge or maybe 3D.

For method 1, there is nothing to say since it is just a straight forward method.

For method 2, I am wondering if the compiler can handle the issue properly. Since each FORTRAN pointer is like a structure according to my understanding so far. IF that is the case, FORTRAN pointer is slower than c pointer since it is not just a simple de-reference? I am also wondering if the type warp of the pointer decrease the performance (that warp is needed to make a pointer matrix). Is ther a particular reason that I should give up method 2 since it should be slower?

Let's take IVF on windows, gfortran and ifort on Linux for example. Since it can be compiler dependent.

UPDATE: Appreciate Stefan's code. I wrote on by my self as well.

program stencil
    implicit none
    type pp
        real*8, pointer :: ptr
    endtype pp
    type(pp), allocatable :: parray(:,:)
    real*8, allocatable, target :: array(:)
    real*8, allocatable :: grad(:,:,:), direct(:,:)
    integer, parameter :: n = 5000
    integer :: i, j
    integer :: clock_rate, clock_start, clock_stop

    allocate(array(n**2+1))
    allocate(parray(0:n+1, 0:n+1))
    allocate(grad(2, n, n))
    call random_number(array)
    array(n**2+1) = 0
    do i = 0, n + 1
        parray(0,i)%ptr => array(n**2+1)
        parray(n+1,i)%ptr => array(n**2+1)
        parray(i,0)%ptr => array(n**2+1)
        parray(i,n+1)%ptr => array(n**2+1)
    enddo
    do i = 1, n
        do j = 1, n
            parray(i,j)%ptr => array((i-1) * n + j)
        enddo
    enddo
    !now stencil
    call system_clock(count_rate=clock_rate)
    call system_clock(count=clock_start)
    do j = 1, n
        do i = 1, n
            grad(1, i, j) = (parray(i + 1,j)%ptr - parray(i - 1,j)%ptr)/2.D0
            grad(2, i, j) = (parray(i,j + 1)%ptr - parray(i,j - 1)%ptr)/2.D0
        enddo
    enddo
    call system_clock(count=clock_stop)
    print *, "pointer, time cost= ", real(clock_stop-clock_start)/real(clock_rate)
    deallocate(array)
    deallocate(parray)
    allocate(direct(0:n+1, 0:n+1))
    call random_number(direct)
    do i = 0, n + 1
        direct(0,i) = 0
        direct(n+1,i) = 0
        direct(i,0) = 0
        direct(i,n+1) = 0
    enddo
    !now stencil directly
    call system_clock(count_rate=clock_rate)
    call system_clock(count=clock_start)
    do j = 1, n
        do i = 1, n
            grad(1, i, j) = (direct(i + 1,j) - direct(i - 1,j))/2.D0
            grad(2, i, j) = (direct(i,j + 1) - direct(i,j - 1))/2.D0
        enddo
    enddo
    call system_clock(count=clock_stop)
    print *, "direct, time cost= ", real(clock_stop-clock_start)/real(clock_rate)
endprogram stencil

result (o0):

pointer, time cost= 2.170000

direct, time cost= 1.127000

result (o2):

pointer, time cost= 0.5110000

direct, time cost= 9.4999999E-02

So FORTRAN pointer is much slower. Stefan has pointed that out earlier. Now I am wondering if there is room for improvement. As I know so far, if I did it with c, the difference should not be this much.

解决方案

At first, I have to apologize, because I misunderstood the way, pointer work in Fortran...


Finally, I was so intrigued from the topic, that I created a test on my own. It is based on an array, which has a surrounding for zeros.

Declaration:

real, dimension(:,:), allocatable, target :: array
real, dimension(:,:,:), allocatable :: res
real, dimension(:,:), pointer :: p1, p2, p3, p4
allocate(array(0:n+1, 0:n+1), source=0.)
allocate(res(n,n,2), source=0.)

Now the methods:

Loops:

do j = 1, n
    do i = 1, n
        res(i,j,1) = array(i+1,j) - array(i-1,j)
        res(i,j,2) = array(i,j+1) - array(i,j-1)
    end do
end do

Array assignment:

res(:,:,1) = array(2:n+1,1:n) - array(0:n-1,1:n)
res(:,:,2) = array(1:n,2:n+1) - array(1:n,0:n-1)

Pointers:

p1 => array(0:n-1,1:n)
p2 => array(1:n,2:n+1)
p3 => array(2:n+1,1:n)
p4 => array(1:n,0:n-1)
res(:,:,1) = p3 - p1
res(:,:,2) = p2 - p4

While the last two methods do rely on the extra layer of zeros, the loops can introduce some conditionals to care for these.

The timings are interesting:

loops:     0.17528710301849060
array:     0.21127231500577182
pointers:  0.21367537401965819

While the array and pointer assignments yield approximately the same timings, the loop construct (mind the loop order! this was a factor of 5!!!) is the fastest method.


UPDATE: I tried to squeeze a bit of performance out of your code and found one small thing. Your code performs with -O2 in 0.95s and 0.30s (with n = 10000).

Transposing your matrix to get a more linear memory access yields a runtime of 0.50s for the pointer part.

parray(i,j)%ptr => array((j-1) * n + i)

IMHO, the problem is the missing information about the pointers, which forbid additional optimization. Using -O3 -fopt-info-missed you get complaints about unknown alignment and non-consecutive accesses. The additional factor 2 compared to my results should stem from the fact, that you are using double precision, while my code is written in single precision...

这篇关于FORTRAN:通过指针矩阵访问数组,性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆