优化快速乘法但慢加法:FMA 和 doubledouble [英] Optimize for fast multiplication but slow addition: FMA and doubledouble

查看:18
本文介绍了优化快速乘法但慢加法:FMA 和 doubledouble的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我第一次获得 Haswell 处理器时,我尝试实施 FMA 来确定 Mandelbrot 集.主要算法是这样的:

intn = 0;for(int32_t i=0; i

这确定 n 像素是否在 Mandelbrot 集中.所以对于双浮点它运行超过 4 个像素(floatn = __m256dintn = __m256i).这需要 4 次 SIMD 浮点乘法和 4 次 SIMD 浮点加法.

然后我修改它以像这样与 FMA 一起使用

intn n = 0;for(int32_t i=0; i

其中 mul_add 调用 _mm256_fmad_pd 而 mul_sub 调用 _mm256_fmsub_pd.此方法使用 4 个 FMA SIMD 操作和两个 SIMD 乘法,这比没有 FMA 的算术操作少了两次.此外,FMA 和乘法可以使用两个端口,而加法只能使用一个.

为了减少我的测试偏差,我放大了一个完全在 Mandelbrot 集中的区域,因此所有值都是 maxiter.在这种情况下,使用 FMA 的方法大约快 27%. 这当然是一个改进,但是从 SSE 到 AVX 使我的性能翻了一番,所以我希望 FMA 可能会再提高两倍.

但后来我发现 所以这个乘法可以在代价中被忽略.因此,我认为使用硬件FMA的改进如此之小的原因是计算被缓慢的双双加法所主导(见下面的汇编).

过去,乘法比加法慢(程序员使用了几种技巧来避免乘法),但在 Haswell 中,情况似乎正好相反.不仅因为 FMA,还因为乘法可以使用两个端口,而加法只能使用一个.

所以我的问题(最后)是:

  1. 当加法比乘法慢时,如何优化?
  2. 有没有一种代数方法可以改变我的算法以使用更多的乘法和更少的添加?我知道有方法可以做相反的事情,例如(x+y)*(x+y) - (x*x+y*y) = 2*x*y 使用两次加法来减少一次乘法.
  3. 有没有办法简化 df64_add 函数(例如使用 FMA)?

如果有人想知道 double-double 方法比 double 慢十倍左右.这还不错,我认为好像有一个硬件四精度类型它可能至少慢两倍,所以我的软件方法比我对硬件的预期慢大约五倍,如果它存在的话.

df64_add 程序集

vmovapd 8(%rsp), %ymm0movq %rdi, %raxvmovapd 72(%rsp), %ymm1vmovapd 40(%rsp), %ymm3vaddpd %ymm1, %ymm0, %ymm4vmovapd 104(%rsp), %ymm5vsubpd %ymm0, %ymm4, %ymm2vsubpd %ymm2, %ymm1, %ymm1vsubpd %ymm2, %ymm4, %ymm2vsubpd %ymm2, %ymm0, %ymm0vaddpd %ymm1, %ymm0, %ymm2vaddpd %ymm5, %ymm3, %ymm1vsubpd %ymm3, %ymm1, %ymm6vsubpd %ymm6, %ymm5, %ymm5vsubpd %ymm6, %ymm1, %ymm6vaddpd %ymm1, %ymm2, %ymm1vsubpd %ymm6, %ymm3, %ymm3vaddpd %ymm1, %ymm4, %ymm2vaddpd %ymm5, %ymm3, %ymm3vsubpd %ymm4, %ymm2, %ymm4vsubpd %ymm4, %ymm1, %ymm1vaddpd %ymm3, %ymm1, %ymm0vaddpd %ymm0, %ymm2, %ymm1vsubpd %ymm2, %ymm1, %ymm2vmovapd %ymm1, (%rdi)vsubpd %ymm2, %ymm0, %ymm0vmovapd %ymm0, 32(%rdi)vzeroupperret

解决方案

为了回答我的第三个问题,我找到了一个更快的双双加法解决方案.我在论文 Implementation 中找到了一个替代定义的图形上的浮点-浮点运算符硬件.

Theorem 5 (Add22 theorem) Let be ah+al and bh+bl the float-float arguments of the following算法:Add22(啊,al,bh,bl)1 r = 啊 ⊕ bh2 如果 |啊 |≥ |bh |然后3 s = ((( ah ⊖ r ) ⊕ bh ) ⊕ b l ) ⊕ a l4 e l se5 s = ((( bh ⊖ r ) ⊕ ah ) ⊕ a l ) ⊕ b l6 (rh, r l) = add12 (r, s)7 返回 (rh , r l)

这是我的实现方式(伪代码):

静态内联 doubledoublen add22(doubledoublen const &a, doubledouble const &b) {加倍 aa,ab,ah,bh,al,bl;布尔掩码;aa = abs(a.hi);//_mm256_and_pdab = abs(b.hi);掩码 = aa >= ab;//_mm256_cmple_pd//z = select(cut,x,y) 是 z = cut 的 SIMD 版本?x : 是的;啊=选择(掩码,a.hi,b.hi);//_mm256_blendv_pdbh = 选择(掩码,b.hi,a.hi);al = 选择(掩码,a.lo,b.lo);bl = 选择(掩码,b.lo,a.lo);加倍 r, s;r = 啊 + bh;s = (((ah - r) + bh) + bl ) + al;返回两个总和(r,s);}

Add22 的这个定义使用 11 个加法而不是 20 个,但它需要一些额外的代码来确定 |ah|>= |bh|.这里讨论如何实现 SIMD minmag 和 maxmag 函数.幸运的是,大部分附加代码没有使用端口 1.现在只有 12 条指令进入端口 1,而不是 20.

这是一个吞吐量分析表IACA 对于新的 Add22

吞吐量分析报告--------------------------块吞吐量:12.05 周期吞吐量瓶颈:端口 1每次迭代循环中的端口绑定:---------------------------------------------------------------------------------------|港口 |0 - DV |1 |2-D |3-D |4 |5 |6 |7 |---------------------------------------------------------------------------------------|循环 |0.0 0.0 |12.0 |2.5 2.5 |2.5 2.5 |2.0 |10.0 |0.0 |2.0 |---------------------------------------------------------------------------------------|数量 |端口压力循环 |||优普斯 |0 - DV |1 |2-D |3-D |4 |5 |6 |7 ||---------------------------------------------------------------------------------------------|1 |||0.5 0.5 |0.5 0.5 ||||||vmovapd ymm3, ymmword ptr [rip]|1 |||0.5 0.5 |0.5 0.5 ||||||vmovapd ymm0, ymmword ptr [rdx]|1 |||0.5 0.5 |0.5 0.5 ||||||vmovapd ymm4, ymmword ptr [rsi]|1 ||||||1.0 ||||vandpd ymm2, ymm4, ymm3|1 ||||||1.0 ||||vandpd ymm3, ymm0, ymm3|1 ||1.0 |||||||CP |vcmppd ymm2, ymm3, ymm2, 0x2|1 |||0.5 0.5 |0.5 0.5 ||||||vmovapd ymm3, ymmword ptr [rsi+0x20]|2 ||||||2.0 ||||vblendvpd ymm1、ymm0、ymm4、ymm2|2 ||||||2.0 ||||vblendvpd ymm4、ymm4、ymm0、ymm2|1 |||0.5 0.5 |0.5 0.5 ||||||vmovapd ymm0, ymmword ptr [rdx+0x20]|2 ||||||2.0 ||||vblendvpd ymm5、ymm0、ymm3、ymm2|2 ||||||2.0 ||||vblendvpd ymm0, ymm3, ymm0, ymm2|1 ||1.0 |||||||CP |vaddpd ymm3, ymm1, ymm4|1 ||1.0 |||||||CP |vsubpd ymm2,ymm1,ymm3|1 ||1.0 |||||||CP |vaddpd ymm1, ymm2, ymm4|1 ||1.0 |||||||CP |vaddpd ymm1, ymm1, ymm0|1 ||1.0 |||||||CP |vaddpd ymm0, ymm1, ymm5|1 ||1.0 |||||||CP |vaddpd ymm2, ymm3, ymm0|1 ||1.0 |||||||CP |vsubpd ymm1,ymm2,ymm3|2^ |||||1.0 |||1.0 ||vmovapd ymmword ptr [rdi], ymm2|1 ||1.0 |||||||CP |vsubpd ymm0, ymm0, ymm1|1 ||1.0 |||||||CP |vsubpd ymm1,ymm2,ymm1|1 ||1.0 |||||||CP |vsubpd ymm3,ymm3,ymm1|1 ||1.0 |||||||CP |vaddpd ymm0, ymm3, ymm0|2^ |||||1.0 |||1.0 ||vmovapd ymmword ptr [rdi+0x20], ymm0

这是旧的吞吐量分析

吞吐量分析报告--------------------------块吞吐量:20.00 周期吞吐量瓶颈:端口 1每次迭代循环中的端口绑定:---------------------------------------------------------------------------------------|港口 |0 - DV |1 |2-D |3-D |4 |5 |6 |7 |---------------------------------------------------------------------------------------|循环 |0.0 0.0 |20.0 |2.0 2.0 |2.0 2.0 |2.0 |0.0 |0.0 |2.0 |---------------------------------------------------------------------------------------|数量 |端口压力循环 |||优普斯 |0 - DV |1 |2-D |3-D |4 |5 |6 |7 ||---------------------------------------------------------------------------------------------|1 |||1.0 1.0 |||||||vmovapd ymm0, ymmword ptr [rsi]|1 ||||1.0 1.0 ||||||vmovapd ymm1, ymmword ptr [rdx]|1 |||1.0 1.0 |||||||vmovapd ymm3, ymmword ptr [rsi+0x20]|1 ||1.0 |||||||CP |vaddpd ymm4, ymm0, ymm1|1 ||||1.0 1.0 ||||||vmovapd ymm5, ymmword ptr [rdx+0x20]|1 ||1.0 |||||||CP |vsubpd ymm2,ymm4,ymm0|1 ||1.0 |||||||CP |vsubpd ymm1,ymm1,ymm2|1 ||1.0 |||||||CP |vsubpd ymm2,ymm4,ymm2|1 ||1.0 |||||||CP |vsubpd ymm0, ymm0, ymm2|1 ||1.0 |||||||CP |vaddpd ymm2, ymm0, ymm1|1 ||1.0 |||||||CP |vaddpd ymm1, ymm3, ymm5|1 ||1.0 |||||||CP |vsubpd ymm6,ymm1,ymm3|1 ||1.0 |||||||CP |vsubpd ymm5,ymm5,ymm6|1 ||1.0 |||||||CP |vsubpd ymm6,ymm1,ymm6|1 ||1.0 |||||||CP |vaddpd ymm1, ymm2, ymm1|1 ||1.0 |||||||CP |vsubpd ymm3,ymm3,ymm6|1 ||1.0 |||||||CP |vaddpd ymm2, ymm4, ymm1|1 ||1.0 |||||||CP |vaddpd ymm3, ymm3, ymm5|1 ||1.0 |||||||CP |vsubpd ymm4,ymm2,ymm4|1 ||1.0 |||||||CP |vsubpd ymm1,ymm1,ymm4|1 ||1.0 |||||||CP |vaddpd ymm0, ymm1, ymm3|1 ||1.0 |||||||CP |vaddpd ymm1, ymm2, ymm0|1 ||1.0 |||||||CP |vsubpd ymm2,ymm1,ymm2|2^ |||||1.0 |||1.0 ||vmovapd ymmword ptr [rdi], ymm1|1 ||1.0 |||||||CP |vsubpd ymm0, ymm0, ymm2|2^ |||||1.0 |||1.0 ||vmovapd ymmword ptr [rdi+0x20], ymm0

如果除了 FMA 之外还有三个操作数单舍入模式指令,则更好的解决方案是.在我看来,

应该有单一的舍入模式说明

a + b + ca * b + c//FMA - 这是迄今为止 x86 中唯一的一个a * b * c

When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this:

intn = 0;
for(int32_t i=0; i<maxiter; i++) {
    floatn x2 = square(x), y2 = square(y); //square(x) = x*x
    floatn r2 = x2 + y2;
    booln mask = r2<cut; //booln is in the float domain non integer domain
    if(!horizontal_or(mask)) break; //_mm256_testz_pd(mask)
    n -= mask
    floatn t = x*y; mul2(t); //mul2(t): t*=2
    x = x2 - y2 + cx;
    y = t + cy;
}

This determines if n pixels are in the Mandelbrot set. So for double floating point it runs over 4 pixels (floatn = __m256d, intn = __m256i). This requires 4 SIMD floating point multiplication and four SIMD floating point additions.

Then I modified this to work with FMA like this

intn n = 0; 
for(int32_t i=0; i<maxiter; i++) {
    floatn r2 = mul_add(x,x,y*y);
    booln mask = r2<cut;
    if(!horizontal_or(mask)) break;
    add_mask(n,mask);
    floatn t = x*y;
    x = mul_sub(x,x, mul_sub(y,y,cx));
    y = mul_add(2.0f,t,cy);
}

where mul_add calls _mm256_fmad_pd and mul_sub calls _mm256_fmsub_pd. This method uses 4 FMA SIMD operations, and two SIMD multiplications which is two less arithmetic operations then without FMA. Additionally, FMA and multiplication can use two ports and addition only one.

To make my tests less biased I zoomed into a region which is entirely in the Mandelbrot set so all the values are maxiter. In this case the method using FMA is about 27% faster. That's certainly an improvement but going from SSE to AVX doubled my performance so I was hoping for maybe another factor of two with FMA.

But then I found this answer in regards to FMA where it says

The important aspect of the fused-multiply-add instruction is the (virtually) infinite precision of the intermediate result. This helps with performance, but not so much because two operations are encoded in a single instruction — It helps with performance because the virtually infinite precision of the intermediate result is sometimes important, and very expensive to recover with ordinary multiplication and addition when this level of precision is really what the programmer is after.

and later gives an example of double*double to double-double multiplication

high = a * b; /* double-precision approximation of the real product */
low = fma(a, b, -high); /* remainder of the real product */

From this, I concluded that I was implementing FMA non-optimally and so I decided to implement SIMD double-double. I implemented double-double based on the paper Extended-Precision Floating-Point Numbers for GPU Computation. The paper is for double-float so I modified it for double-double. Additionally, instead of packing one double-double value in a SIMD registers I pack 4 double-double values into one AVX high register and one AVX low register.

For the Mandelbrot set what I really need is double-double multiplication and addition. In that paper these are the df64_add and df64_mult functions. The image below shows the assembly for my df64_mult function for software FMA (left) and hardware FMA (right). This clearly shows that hardware FMA is a big improvement for double-double multiplication.

So how does hardware FMA perform in the double-double Mandelbrot set calculation? The answer is that's only about 15% faster than with software FMA. That's much less than I hoped for. The double-double Mandelbrot calculation needs 4 double-double additions, and four double-double multiplications (x*x, y*y, x*y, and 2*(x*y)). However, the 2*(x*y) multiplication is trivial for double-double so this multiplication can be ignored in the cost. Therefore, the reason I think the improvement using hardware FMA is so small is that the calculation is dominated by the slow double-double addition (see the assembly below).

It used to be that multiplication was slower than addition (and programers used several tricks to avoid multiplication) but with Haswell it seems that it's the other way around. Not only due to FMA but also because multiplication can use two ports but addition only one.

So my questions (finally) are:

  1. How does one optimize when addition is slow compared to multiplication?
  2. Is there an algebraic way to change my algorithm to use more multiplications and less additions? I know there are method to do the reverse, e.g. (x+y)*(x+y) - (x*x+y*y) = 2*x*y which use two more additions for one less multiplication.
  3. Is there a way to simply the df64_add function (e.g. using FMA)?

In case anyone is wondering the double-double method is about ten times slower than double. That's not so bad I think as if there was a hardware quad-precision type it would likely be at least twice as slow as double so my software method is about five times slower than what I would expect for hardware if it existed.

df64_add assembly

vmovapd 8(%rsp), %ymm0
movq    %rdi, %rax
vmovapd 72(%rsp), %ymm1
vmovapd 40(%rsp), %ymm3
vaddpd  %ymm1, %ymm0, %ymm4
vmovapd 104(%rsp), %ymm5
vsubpd  %ymm0, %ymm4, %ymm2
vsubpd  %ymm2, %ymm1, %ymm1
vsubpd  %ymm2, %ymm4, %ymm2
vsubpd  %ymm2, %ymm0, %ymm0
vaddpd  %ymm1, %ymm0, %ymm2
vaddpd  %ymm5, %ymm3, %ymm1
vsubpd  %ymm3, %ymm1, %ymm6
vsubpd  %ymm6, %ymm5, %ymm5
vsubpd  %ymm6, %ymm1, %ymm6
vaddpd  %ymm1, %ymm2, %ymm1
vsubpd  %ymm6, %ymm3, %ymm3
vaddpd  %ymm1, %ymm4, %ymm2
vaddpd  %ymm5, %ymm3, %ymm3
vsubpd  %ymm4, %ymm2, %ymm4
vsubpd  %ymm4, %ymm1, %ymm1
vaddpd  %ymm3, %ymm1, %ymm0
vaddpd  %ymm0, %ymm2, %ymm1
vsubpd  %ymm2, %ymm1, %ymm2
vmovapd %ymm1, (%rdi)
vsubpd  %ymm2, %ymm0, %ymm0
vmovapd %ymm0, 32(%rdi)
vzeroupper
ret

解决方案

To answer my third question I found a faster solution for double-double addition. I found an alternative definition in the paper Implementation of float-float operators on graphics hardware.

Theorem 5 (Add22 theorem) Let be ah+al and bh+bl the float-float arguments of the following
algorithm:
Add22 (ah ,al ,bh ,bl)
1 r = ah ⊕ bh
2 if | ah | ≥ | bh | then
3     s = ((( ah ⊖ r ) ⊕ bh ) ⊕ b l ) ⊕ a l
4 e l s e
5     s = ((( bh ⊖ r ) ⊕ ah ) ⊕ a l ) ⊕ b l
6 ( rh , r l ) = add12 ( r , s )
7 return (rh , r l)

Here is how I implemented this (pseudo-code):

static inline doubledoublen add22(doubledoublen const &a, doubledouble const &b) {
    doublen aa,ab,ah,bh,al,bl;
    booln mask;
    aa = abs(a.hi);                //_mm256_and_pd
    ab = abs(b.hi); 
    mask = aa >= ab;               //_mm256_cmple_pd
    // z = select(cut,x,y) is a SIMD version of z = cut ? x : y;
    ah = select(mask,a.hi,b.hi);   //_mm256_blendv_pd
    bh = select(mask,b.hi,a.hi);
    al = select(mask,a.lo,b.lo);
    bl = select(mask,b.lo,a.lo);

    doublen r, s;
    r = ah + bh;
    s = (((ah - r) + bh) + bl ) + al;
    return two_sum(r,s);
}

This definition of Add22 uses 11 additions instead of 20 but it requires some additional code to determine if |ah| >= |bh|. Here is a discussion on how to implement SIMD minmag and maxmag functions. Fortunately, most of the additional code does not use port 1. Now only 12 instructions go to port 1 instead of 20.

Here is a throughput analysis form IACA for the new Add22

Throughput Analysis Report
--------------------------
Block Throughput: 12.05 Cycles       Throughput Bottleneck: Port1

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 12.0 | 2.5    2.5  | 2.5    2.5  | 2.0  | 10.0 | 0.0  | 2.0  |
---------------------------------------------------------------------------------------


| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rip]
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rdx]
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm4, ymmword ptr [rsi]
|   1    |           |     |           |           |     | 1.0 |     |     |    | vandpd ymm2, ymm4, ymm3
|   1    |           |     |           |           |     | 1.0 |     |     |    | vandpd ymm3, ymm0, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vcmppd ymm2, ymm3, ymm2, 0x2
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rsi+0x20]
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm1, ymm0, ymm4, ymm2
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm4, ymm4, ymm0, ymm2
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rdx+0x20]
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm5, ymm0, ymm3, ymm2
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm0, ymm3, ymm0, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm3, ymm1, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm1, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm1, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm3, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm2, ymm3
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi], ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm2, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm3, ymm3, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm3, ymm0
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi+0x20], ymm0

and here is the throughput analysis from the old

Throughput Analysis Report
--------------------------
Block Throughput: 20.00 Cycles       Throughput Bottleneck: Port1

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 20.0 | 2.0    2.0  | 2.0    2.0  | 2.0  | 0.0  | 0.0  | 2.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rsi]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovapd ymm1, ymmword ptr [rdx]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rsi+0x20]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm4, ymm0, ymm1
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovapd ymm5, ymmword ptr [rdx+0x20]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm4, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm1, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm4, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm0, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm3, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm6, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm5, ymm5, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm6, ymm1, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm3, ymm3, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm4, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm3, ymm3, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm4, ymm2, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm1, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm1, ymm2
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi], ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm2
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi+0x20], ymm0

A better solution would be if there were three operand single rounding mode instructions besides FMA. It seems to me there should be single rounding mode instructions for

a + b + c
a * b + c //FMA - this is the only one in x86 so far
a * b * c

这篇关于优化快速乘法但慢加法:FMA 和 doubledouble的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆