为什么当使用快速数学时GCC或Clang不优化互逆于1指令 [英] Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math

查看:245
本文介绍了为什么当使用快速数学时GCC或Clang不优化互逆于1指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有人知道为什么GCC / Clang不会在下面的代码示例中使用fast-math选项简单地使用RCPPS指令?em1 test1 是否有另一个编译器标志会产生这个代码?

  typedef float float4 __attribute __((vector_size(16) 

float4 test1(float4 v)
{
return 1.0f / v;
}

您可以在这里看到编译的输出:https://goo.gl/jXsqat

解决方案

因为精度 RCPPS 批次低于 float 部门。



启用此优化的选项不适合作为 -ffast-math 的一部分。



x86目标选项的gcc手册说,实际上有一个选项(与 -ffast-math )获得gcc使用它们(与牛顿 - Raphson迭代) :



  • -mrecip
    此选项允许使用RCPSS和RSQRTSS指令(及其向量化变体RCPPS和RSQRTPS)与额外的
    Newton-Raphson步骤来提高精度,而不是DIVSS和SQRTSS
    (及其向量化变体)单精度浮点
    参数。仅当
    -funsafe-math-optimizations与-finite-math-only和-fno-trapping-math一起启用时才会生成这些指令。注意,尽管序列
    的吞吐量高于非互易指令的吞吐量,序列的
    精度可以减少高达2 ulp(即,
    的逆注意,GCC在已经使用-ffast-math(或上面的选项)的RSQRTSS(或RSQRTPS)方面实现了1.0f / sqrtf(x)



    另请注意,GCC发出上面的序列,用于矢量化单浮点除法的附加Newton-Raphson步骤和
    矢量化sqrtf(x)已经使用-ffast-math(或上面的选项
    组合),并且不需要-mrecip。


  • -mrecip = opt




控制可以使用相互估计指令。 opt是以逗号分隔的选项列表,可以在
之前加上!来反转选项:

 'all'
启用所有估计指令。
'default'
启用默认说明,等效于-mrecip。
'none'
禁用所有估计指令,相当于-mno-recip。
'div'
启用标量除法的近似。
'vec-div'
启用向量化除法的近似。
'sqrt'
启用标量平方根的近似。
'vec-sqrt'
启用向量化平方根的近似。

因此,例如,-mrecip = all,!sqrt启用所有的倒数近似,平方根。


请注意,英特尔的新Skylake设计进一步提高FP分割性能到8-11c延迟,1 / 3c吞吐量。 (或者对于256b向量,每5c吞吐量一个,但对于 vdivps 的相同等待时间)。他们扩大了分隔线,因此AVX vdivps ymm 现在与128b向量的延迟相同。



Haswell做了256b div和sqrt,延迟/ recip吞吐量的两倍,所以他们显然只有128b的分频器。)Skylake也管道两个操作更多,所以大约4个div操作可以在飞行。



因此,几年来,一旦Skylake普及,它只会值得做 rcpps 如果你需要多次划分相同的事物。 rcpps 和一对夫妇 fma 可能有稍高的吞吐量,但延迟较差。此外, vdivps 只是一个单一的uop;所以更多的执行资源将可用于与分区同时发生的事情。



还有待于看到AVX512的初始实现将是什么样子。大概 rcpps 和一对夫妇FMAs牛顿 - 拉夫森迭代将是一个胜利,如果FP分割性能是一个瓶颈。如果uop吞吐量是一个瓶颈,并且有很多其他工作要做,而分区在飞行中, vdivps zmm 可能仍然是好的(除非重复使用相同的除数,课程)。


Does anyone know why GCC/Clang will not optimist function test1 in the below code sample to simply use just the RCPPS instruction when using the fast-math option? Is there another compiler flag that would generate this code?

typedef float float4 __attribute__((vector_size(16)));

float4 test1(float4 v)
{
    return 1.0f / v;
}

You can see the compiled output here: https://goo.gl/jXsqat

解决方案

Because the precision of RCPPS is a lot lower than float division.

An option to enable that optimization would not be appropriate as part of -ffast-math.

The x86 target options of the gcc manual says there in fact is an option that (with -ffast-math) does get gcc to use them (with a Newton-Raphson iteration):

  • -mrecip This option enables use of RCPSS and RSQRTSS instructions (and their vectorized variants RCPPS and RSQRTPS) with an additional Newton-Raphson step to increase precision instead of DIVSS and SQRTSS (and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -finite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).

    Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS) already with -ffast-math (or the above option combination), and doesn't need -mrecip.

    Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized sqrtf(x) already with -ffast-math (or the above option combination), and doesn't need -mrecip.

  • -mrecip=opt

This option controls which reciprocal estimate instructions may be used. opt is a comma-separated list of options, which may be preceded by a ‘!’ to invert the option:

’all’
      Enable all estimate instructions.
‘default’
    Enable the default instructions, equivalent to -mrecip.
‘none’
    Disable all estimate instructions, equivalent to -mno-recip.
‘div’
    Enable the approximation for scalar division.
‘vec-div’
    Enable the approximation for vectorized division.
‘sqrt’
    Enable the approximation for scalar square root.
‘vec-sqrt’
    Enable the approximation for vectorized square root. 

So, for example, -mrecip=all,!sqrt enables all of the reciprocal approximations, except for square root.

Note that Intel's new Skylake design further improves FP division performance, to 8-11c latency, 1/3c throughput. (Or one per 5c throughput for 256b vectors, but same latency for vdivps). They widened the dividers, so AVX vdivps ymm is now the same latency as for 128b vectors.

(SnB to Haswell did 256b div and sqrt with about twice the latency / recip-throughput, so they clearly only had 128b-wide dividers.) Skylake also pipelines both operations more, so about 4 div operations can be in flight. sqrt is faster, too.

So in several years, once Skylake is widespread, it'll only be worth doing rcpps if you need to divide by the same thing multiple times. rcpps and a couple fma might possibly have slightly higher throughput but worse latency. Also, vdivps is only a single uop; so more execution resources will be available for things to happen at the same time as the division.

It remains to be seen what the initial implementation of AVX512 will be like. Presumably rcpps and a couple FMAs for Newton-Raphson iterations will be a win if FP division performance is a bottleneck. If uop throughput is a bottleneck and there's plenty of other work to do while the divisions are in flight, vdivps zmm is probably still good (unless the same divisor is used repeatedly, of course).

这篇关于为什么当使用快速数学时GCC或Clang不优化互逆于1指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆