英特尔Fortran矢量化:矢量循环比标量更高 [英] Intel Fortran vectorisation: vector loop cost higher than scalar

查看:396
本文介绍了英特尔Fortran矢量化:矢量循环比标量更高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用英特尔Fortran 15测试和优化遗留代码,并且我有这个简单的循环:

  do ir = 1,N(lev)
G1(lev)%D(ir)= 0.d0
G2(lev)%D(ir)= 0.d0
enddo

其中 lev 等于某个整数。

编译器的结构和索引相当复杂,但它可以在任务中成功,正如我在其他行上看到的。
现在,在上面的循环中,我从编译报告中得到: 38,2)
注释#15399:向量化支持:展开因子设置为4
备注#15300:LOOP向量化
备注#15462:未掩码索引(或聚集)加载:2
注释#15475:---开始矢量循环成本汇总---
注释#15476:标量循环成本:12
注释#15477:矢量循环成本:20.000
注释#15478 :估计潜在的加速:2.340
注释#15479:轻量级矢量操作:5
注释#15481:重开销矢量操作​​:1
注释#15488:---结束矢量循环开销摘要---
LOOP END

我的问题是:矢量循环成本高于标量?我可以做些什么来实现估计的潜在加速?

解决方案

循环成本是一次循环迭代持续时间的估计值,并且在向量化情况下花费的时间稍长一些,但是您可以在一次向量化迭代中处理更多数组项。



在你的情况下,加速大约是12/20 * 4 = 2.4,因为你可以在一次迭代中处理4个双精度数组元素(可能是AVX指令)。

I'm testing and optimising a legacy code with Intel Fortran 15, and I have this simple loop:

do ir=1,N(lev)
  G1(lev)%D(ir) = 0.d0
  G2(lev)%D(ir) = 0.d0
enddo

where lev is equal to some integer.

Structures and indexes are quite complex for the compiler, but it can succeed in the task, as I can see on other lines. Now, on the above loop, I get this from the compilation report:

LOOP BEGIN at MLFMATranslationProd.f90(38,2)
  remark #15399: vectorization support: unroll factor set to 4
  remark #15300: LOOP WAS VECTORIZED
  remark #15462: unmasked indexed (or gather) loads: 2
  remark #15475: --- begin vector loop cost summary ---
  remark #15476: scalar loop cost: 12
  remark #15477: vector loop cost: 20.000
  remark #15478: estimated potential speedup: 2.340
  remark #15479: lightweight vector operations: 5
  remark #15481: heavy-overhead vector operations: 1
  remark #15488: --- end vector loop cost summary ---
LOOP END

My question is: how is it that the vector loop cost is higher than the scalar one? What can I do to go towards the estimated potential speedup?

解决方案

The loop cost is an estimate of the duration of one loop iteration and it takes somewhat longer in the vectorized case, but you can process more array items in one vectorized iteration.

In your case the speedup is roughly 12 / 20 * 4 = 2.4 because you can process 4 double precision array elements in one iteration (probably the AVX instructions).

这篇关于英特尔Fortran矢量化:矢量循环比标量更高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆