如何选择二维阵列A(i,j)的最佳配置 [英] How to choose the best configuration of 2D array A(i,j)

查看:137
本文介绍了如何选择二维阵列A(i,j)的最佳配置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望你能让我解释这件事。我正在与Fortran合作,实际上是在CFD主题上编写我的代码,下面(仅为了简单起见,仅举个例子)就是对我的案例的简短解释:


  1. 我应该使用一个二维数组A(i,j)和一个一维数组B(i)我必须做2次循环,这是第一个循环次数应为50,000次,第二次循环次数为5次(不能更改)。

  2. 上面的点号2应循环10000次。

我用2个版本编写代码(我称它们为Prog_A和Prog_B)。



第一个如下所示:

  PROGRAM PROG_A 
REAL * 8,DIMENSION(50000,5):: A
REAL * 8,DIMENSION(50000):: B
REAL * 8 :: TIME1,TIME2
!初始值
DO I = 1,50000
A(I,1)= 1.0
A(I,2)= 2.0
A(I,3)= 3.0
A(I,4)= 4.0
A(I,5)= 5.0

B(I)= I
END DO
!计算机运行时间(开始)
CALL CPU_TIME(TIME1)
DO K = 1,100000
DO I = 1,50000!应该首先计算50,000个元素的数组(不能更改)
DO J = 1,5
A(I,J)= A(I,J)+ SQRT(B(I))
END DO
END DO
END DO
!计算机运行时间(完成)
CALL CPU_ TIME(TIME2)
PRINT *,'Elapsed real time =',TIME2-TIME1,'second(s)'
END PROGRAM PROG_A

第二个是:

  PROGRAM PROG_B 
REAL * 8,DIMENSION(5,50000):: A
REAL * 8,DIMENSION(50000):: B
REAL * 8 :: TIME1,TIME2
!仅用于初始值$ b (2,J)= 2.0
A(3,J)= 3.0
A(1,J)= 1.0
A 4,J)= 4.0
A(5,J)= 5.0

B(J)= J
END DO
!计算机的运行时间(开始)
CALL CPU_TIME(TIME1)
DO K = 1,100000
DO J = 1,50000!应该首先计算50,000个元素的数组(不能更改)
DO I = 1,5
A(I,J)= A(I,J)+ SQRT(B(J))
END DO
END DO
END DO
!计算机的运行时间(完成)
CALL CPU_TIME(TIME2)
PRINT *,'经过实时=',TIME2-TIME1''秒s)'
END PROGRAM PROG_B

正如您所看到的,第一个我使用二维数组A(50000,5),第二个使用二维数组A(5,50000)。

据我所知,由于Fortran基于列专业,所以第二种情况会比第一种情况更快,因为我执行(在第二种情况下)循环阵列的最内侧(在这种情况下,i = 1,...,5)。



但是在gfortran编译(使用-O3优化)后,我发现第二个比第一个慢得多。结果如下:


  1. 第一种情况:消耗时间= 29.187 s

  2. 第二种情况:已消耗时间= 70.496 s

任何人都可以解释为什么?

这两种情况的结果都是一样的。

解决方案

优化的一部分基于实施选择的启发式方法。
您不会得到完全确定性的解释。
当编写允许像
Fortran这样的完全数组操作的语言编写程序时,要点是:
,如果您可以将某些东西表达为完整的数组/矢量操作,请执行此操作并让编译器为您生成循环。这样,唯一要做的就是用你的编译器测试最高效率并坚持下去。



请记住,它依赖于编译器,所以它会随着编译器。
例如,如果我按照原样(将k次迭代次数
从100000更改为10000来加速,每次平均5次),则
这是我的电脑有不同的编译器。
我没有时间检查新版本的编译器。

  pgf90 14 
A:5.05 s
B:0.18 s

gfortran 4.9
A:1.05 s
B:4.02 s

借口13
A:2.01 s
B:1.08 s

您可以看到gfortran告诉你B是坏的,pgfortran
告诉相反的结果,并完全吹出来自A的结果。现在,如果我通过向量化来让编译器完成这项工作。在这里,我只修改了A
,并删除了I循环来获得这个:

$ p $ DO K = 1,10000
DO = 1,5
A(I,J)= A(I,J)+ SQRT(B(I))
END DO
END DO

然后我们得到这个(只有程序A)

  pgf90 14:5.05 s 

gfortran 4.9:5.04

ifort 13:2.02 s

pgfortran和ifort是稳定的,而gfortran正在利用第一种情况下的诡计,可能是haraldkl的建议(请参阅第5项)。当我们进行矢量化时,诀窍并不明显,
gfortran表现不佳。看来,ifort和pgfortran
只是简单地重写循环,让你有正确的顺序。



如果我变得更聪明并且消除K循环

  DO J = 1,5 
A(:,J)= A(:,J)+ 10000 * SQRT (B(:))!这似乎是最终结果
END DO

然后我们得到这个(只有程序A)

  pgf90 14:0.001 

gfortran 4.9:0.001

ifort 13 :0.001 s

所有编译器都变得相当,因为几乎没有任何东西需要优化。
您可以看到,只需使用数组操作即可优化所有内容。






更新
高性能Mark在评论中指出,如果编译器发现结果未被使用,实际上可能会跳过所有计算,这可能会在某些实现中发生。在这个答案中提出的结果说明了这种可能性,尽管我没有提到它。为了防止编译器完全跳过代码,我在计算之后打印了结果的所有元素的总和(在计时之后);结果与小数点后的第三位相同,这足以满足〜372684326034.9146(〜10 ^ 12)的结果。这足以确保编译器执行实际计算。我完全忘了在答案中提到它。


Hopefully you could me explaining this thing. I'm working with Fortran and actually writing my code on CFD topic, and below (just for sake of simplicity and just for example) are short explanations of my case:

  1. I should use one 2D array A(i,j) and one 1D array B(i)
  2. I have to do 2 times looping, which is the first looping should be 50,000 times and the second one is 5 times (can't be changed).
  3. The point number 2 above should be looped 10,000 times.

I write the codes with 2 versions (I called them Prog_A and Prog_B).

The first one is as shown below:

PROGRAM PROG_A
 REAL*8, DIMENSION(50000,5):: A
 REAL*8, DIMENSION(50000)::B  
 REAL*8:: TIME1,TIME2
       !Just for initial value
           DO I=1, 50000
              A(I,1)=1.0
              A(I,2)=2.0
              A(I,3)=3.0
              A(I,4)=4.0
              A(I,5)=5.0

              B(I)=I 
           END DO
       !Computing computer's running time (start)
           CALL CPU_TIME(TIME1)
                 DO K=1, 100000
                    DO I=1, 50000             !Array should be computed first for 50,000 elements (can't be changed)
                        DO J=1, 5
                           A(I,J)=A(I,J)+SQRT(B(I))   
                        END DO
                    END DO
                 END DO
       !Computing computer's running time (finish)
           CALL CPU_TIME(TIME2)
             PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM PROG_A

The second one is:

PROGRAM PROG_B
 REAL*8, DIMENSION(5,50000):: A
 REAL*8, DIMENSION(50000)::B  
 REAL*8:: TIME1,TIME2
       !Just for initial value
           DO J=1, 50000
              A(1,J)=1.0
              A(2,J)=2.0
              A(3,J)=3.0
              A(4,J)=4.0
              A(5,J)=5.0

              B(J)=J 
           END DO
       !Computing computer's running time (start)
           CALL CPU_TIME(TIME1)
                 DO K=1, 100000
                    DO J=1, 50000             !Array should be computed first for 50,000 elements (can't be changed)
                        DO I=1, 5
                           A(I,J)=A(I,J)+SQRT(B(J))   
                        END DO
                    END DO
                 END DO
       !Computing computer's running time (finish)
           CALL CPU_TIME(TIME2)
             PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM PROG_B

As you can see the different is for the first one I used 2D array A(50000,5) and for the second one I used 2D array A(5,50000).

To my knowledge, since Fortran is based on "column major", so the second case would be faster than the first one, since I performed (in the second one) the looping for the most inner side of array (in this case i=1, ..., 5).

But after compiled on gfortran (with -O3 optimization), I've found that the second one is even much slower than the first one. Here is the result:

  1. First case : elapsed time = 29.187 s
  2. Second case : elapsed time = 70.496 s

Could anyone explain me why?

PS: The results of both cases are same for sure.

解决方案

A part of the optimization is based on heuristics chosen by the implementation. You are not going to get a totally deterministic explanation. When writing programs in languages that allows full array operations like Fortran, the main point is: if you can express something as a full array / or vector operation, do it and let the compiler generates the loops for you. That way, the only thing to do is to test with your compiler the most efficient and stick with it.

And remember, it is compiler dependent so it will change with the compiler. For example, if I take your programs as is (changing the number of k iterations from 100000 to 10000 to speed it, and averaging 5 runs for each), this is the timing on my computer with different compilers. I don't have the time to check with new versions of compiler for you.

pgf90 14
A: 5.05 s
B: 0.18 s

gfortran 4.9
A: 1.05 s
B: 4.02 s

ifort 13
A: 2.01 s
B: 1.08 s

You can see that where gfortran tells you that B is bad, pgfortran tells the opposite and totally blows the results from A

Now if I vectorize to let the compiler do the job. Here I modify only A and eliminate the I loop to have this:

DO K=1, 10000
    DO J=1, 5
        A(I,J)=A(I,J)+SQRT(B(I))
    END DO
END DO

Then we get this (only program A)

pgf90 14: 5.05 s

gfortran 4.9: 5.04

ifort 13: 2.02 s

pgfortran and ifort are stable while gfortran is exploiting a trick in the first case, possibly the suggestion of haraldkl (see the factor 5). When we vectorize, the trick is not obvious gfortran does not perform well. It seems that ifort and pgfortran simply rewrite the loop for you to have the right ordering.

And if I get smarter and elimite the K-loop too

    DO J=1, 5
        A(:,J)=A(:,J)+10000*SQRT(B(:)) ! this seems to be final result
    END DO

Then we get this (only program A)

pgf90 14: 0.001

gfortran 4.9: 0.001

ifort 13: 0.001 s

All the compilers become equivalent because there is almost nothing to optimize. You see that you can optimize everything be simply using array operations.


Update High Performance Mark pointed out in comment that the compiler might actually skip all the computation if it found that the result is not used, which might happen with some implementations. The results presented in this answer accounted for that possibility, even though I did not mention it in the first place. To prevent the compiler from skipping the code entirely, I printed the sum of all the elements of the result after the computation (after the timing); The result is identical to the 3rd digit after the decimal, which is good enough for a result of ~372684326034.9146 (~10^12). This is enough to ensure that the compiler does the actual computation. I totally forgot to mention it in the answer.

这篇关于如何选择二维阵列A(i,j)的最佳配置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆