矢量化一个环路,它访问非连续的存储器位置 [英] vectorize a loop which accesses non-consecutive memory locations

查看:123
本文介绍了矢量化一个环路,它访问非连续的存储器位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这种结构的环

参考:麦克斯韦code示例

Reference : Maxwell Code Example

do z=1,zend
    do y=1,yend
        do x=1,xend
            k=arr(x,y,z)
            do while(k.ne.0)
                ix=fooX(k)
                iy=fooY(k)
                iz=fooZ(k)
                x1=x(ix  ,iy  ,iz)
                x2=x(ix+1,iy  ,iz)
                x3=x(ix  ,iy+1,iz)
                x4=x(ix+1,iy+1,iz)
                x5=x(ix  ,iy  ,iz+1)
                x6=x(ix+1,iy  ,iz+1)
                x7=x(ix  ,iy+1,iz+1)
                x8=x(ix+1,iy+1,iz+1)

                y1=y(ix  ,iy  ,iz)
                y2=y(ix+1,iy  ,iz)
                y3=y(ix  ,iy+1,iz)
                y4=y(ix+1,iy+1,iz)
                y5=y(ix  ,iy  ,iz+1)
                y6=y(ix+1,iy  ,iz+1)
                y7=y(ix  ,iy+1,iz+1)
                y8=y(ix+1,iy+1,iz+1)

                z1=z(ix  ,iy  ,iz)
                z2=z(ix+1,iy  ,iz)
                z3=z(ix  ,iy+1,iz)
                z4=z(ix+1,iy+1,iz)
                z5=z(ix  ,iy  ,iz+1)
                z6=z(ix+1,iy  ,iz+1)
                z7=z(ix  ,iy+1,iz+1)
                z8=z(ix+1,iy+1,iz+1)
                sumX+=x1+x2+..x8
                sumY+=y1+y2+..y8
                sumZ+=z1+z2+..z8

                k=linkArr(k)
            enddo
        enddo
    enddo
enddo

X1通过x8的是长方体的8个顶点。有三个挑战这个矢量化code。其中之一是,在8数组元素不在存储器中连续。二是用链表连接以及while循环结构所固有的。三IX,IY的价值观,从fooX,fooY返回IZ,福兹不是不是连续的。因此,每次循环有着完全不同的一套IX,IY,IZ的。因此,即使在整个迭代内存访问是分散的。
我尝试以下方法:
1.展开3级DO循环为:

x1 through x8 are the 8 corners of a rectangular cuboid. There are three challenges to vectorize this code. One is that the 8 array elements are not contiguous in memory. Second is the inherent while loop structure along with linked List access. Third the values of ix, iy, iz returned from from fooX, fooY, fooZ are not not contiguous. So each iteration of the loop has a completely different set of ix, iy, iz. So the even across the iterations the memory access is scattered. I tried the following approaches: 1. unrolled the 3-level DO loops as :

do z=1,zend
    do y=1,yend
        do x=1,xend  
           if(arr(x,y,z).NE.0) then
                kArr(indx)=arr(x,y,z)
                DO WHILE (kArr(indx).NE.0)
                  indx = indx + 1
                  kArr(indx)=linkArr(kArr(indx-1))
                ENDDO
            endif
        enddo
    enddo
enddo

有了这个,我已经摆脱了while循环结构,现在我能够在卡尔运行一个大的循环里,我8族元素(说我的VPU可以同时容纳8组数据)。它没有给出一个性能改进。如果有人有兴趣,我可以张贴的这些细节。我需要如何优化这一code建议。我尝试另一种选择是,X,Y,Z轴数据在一个单一的阵列组合,这样,当我计算X1,Y1和放大器; Z1也将在邻近的存储器位置

With this i have got rid of the while loop structure and now I'm able to run one big loop on kArr inside which i group 8 elements (say my VPU can accomodate 8 sets of data at a time). It did not give a performance improvement. I can post the details of these if anyone is interested. I need suggestions on how to optimize this code. Another option i tried was to combine x,y,z data in a single array so that when i compute x1, y1 & z1 also will be in adjacent memory locations.

推荐答案

这while循环杀了你。类似的情况在几年前,我在性能上略有改善做这样的事情:

That while loop is killing you. In a similar situation a few years back, I got a modest improvement in performance doing something like this:

! at top of your code, introduce:
integer :: special_index
integer :: ix(1000), iy(1000), iz(1000)  !promoting scalars to arrays.
                                         ! make as big as possibly needed.

! code as usual until you get to your loops, then

! first, make lookup table
special_index=0
do z=1,zend
  do y=1,yend
    do x=1,xend
      k=arr(x,y,z)
      do while(k.ne.0)
        special_index=special_index+1
        ix(special_index)=fooX(k)
        iy(special_index)=fooY(k)
        iz(special_index)=fooZ(k)
        k=linkArr(k)
      enddo
    enddo
  enddo
endoo
! and now we do the calculation, loop over lookup table:
do n=1,special_index
  x1=x(ix(n)  ,iy(n)  ,iz(n))
  x2=x(ix(n)+1,iy(n)  ,iz(n))
  x3=x(ix(n)  ,iy(n)+1,iz(n))
  etc.
enddo

就像我说的,帮我在几年前。你的情况可能会有所不同。第一环路仍然不会矢量化,但第二个可能,并可能提供更好的性能。

Like I said, this helped me a few years back. Your mileage may vary. The first loop still won't vectorize, but the second one might, and it might give better performance.

这篇关于矢量化一个环路,它访问非连续的存储器位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆