优化快速乘法但缓慢加入:FMA和doubledouble [英] Optimize for fast multiplication but slow addition: FMA and doubledouble

查看:245
本文介绍了优化快速乘法但缓慢加入:FMA和doubledouble的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我第一次得到了Haswell的处理器,我试图实施FMA确定Mandelbrot集。主要的算法是这样的:

  INTN = 0;
对于(int32_t我= 0; I< MAXITER;我++){
    floatn X2 =平方(x)中,Y2 =平方(y)基//方(X)= X * X
    floatn R2 = X 2 + Y;
    booln面膜= R2<切块; // booln是在浮法领域的非整数域
    如果中断(horizo​​ntal_or(面罩)!); // _ mm256_testz_pd(掩模)
    N - =口罩
    floatn T = X * Y; MUL2(T); // MUL2(T):T * = 2
    X = X2 - Y2 + CX;
    Y = T + CY;
}

这确定 N 像素在Mandelbrot集。因此,对于双浮点运行超过400像素( floatn = __m256d INTN = __m256i )。这需要4个SIMD浮点乘法和四SIMD浮点加法。

然后我修改了这个与FMA的工作像这样

  INTN N = 0;
对于(int32_t我= 0; I< MAXITER;我++){
    floatn R2 = mul_add(X,X,Y * y)基
    booln面膜= R2<切块;
    如果中断(horizo​​ntal_or(面罩)!);
    add_mask(正,掩模);
    floatn T = X * Y;
    X = mul_sub(X,X,mul_sub(Y,Y,CX));
    Y = mul_add(2.0F,T,CY);
}

在这里mul_add通话 _mm256_fmad_pd 和mul_sub通话 _mm256_fmsub_pd 。这种方法采用4 FMA SIMD运算,以及两个SIMD乘法其然后少两个算术运算,而不FMA。此外,FMA和乘法可以用两个端口和相加仅一个

为了使我的测试中较少偏见我放大到这完全是在Mandelbrot集,因此所有的值 MAXITER 的区域。在这种情况下的使用FMA的方法是约27%的速度。这当然是一个进步,但上证去AVX一倍我的表现让我希望的,也许另一个与FMA两个因素。

但后来我发现<一个href=\"https://stackoverflow.com/questions/13292013/is-there-any-scenario-where-function-fma-in-libc-can-be-used/18239795#18239795\">this在回答关于FMA的地方说


  的稠合乘加指令的重要方面是中间结果的(几乎)无限precision。这有助于表现,但没有这么多,因为两个操作都设有codeD在一个指令 - 它有助于性能,因为中间结果的几乎无限的precision有时是重要的,与普通的恢复非常昂贵乘法和加法的时候这个级别precision的是真正的程序员追求的。


和后来给出了双*双待<一个例子href=\"https://en.wikipedia.org/wiki/Quadruple-$p$pcision_floating-point_format#Double-double_arithmetic\">double-double乘法

 高= A * B; / *双$ P真正的产品* $的近似pcision /
低= FMA(A,B,志高); / *真正的产品的剩余部分* /

由此,我得出结论,我正在执行FMA不很理想,所以我决定执行SIMD两双。我实现了两双基于纸延伸期precision浮点数为GPU计算。本文是双浮子所以我修改它两双。此外,而不是在SIMD寄存器包装一间双人双值我包了4次两双值到一个AVX高位寄存器和一个AVX低寄存器。

有关Mandelbrot集我真正需要的是两双乘法和加法。在该文件中,这些都是 df64_add df64_mult 功能。
下图显示了我的 df64_mult 功能组装<一个href=\"https://stackoverflow.com/questions/28630864/how-is-fma-implemented/30121217#30121217\">software FMA (左)和硬件F​​MA(右)。这清楚地表明硬件FMA是两双乘法一个很大的进步。

那么,如何硬件FMA的两双曼德勃罗集进行计算? 答案是,它比软件快FMA只有15%左右。这比我希望的要少得多。的两双曼德尔布罗计算需要4两双的增加,四双电双乘法( X * X Y *是 X * Y 2 *(X * Y))。但是,<一个href=\"https://stackoverflow.com/questions/7720668/fast-multiplication-division-by-2-for-floats-and-doubles-c-c/30453842#30453842\"><$c$c>2*(x*y)乘法是平凡的两双因此这个乘法可以在成本被忽略。因此,有理由,我认为使用硬件FMA的提高是如此之小的是,该计算是由慢两双除了(见下面的组装)为主。

它曾经是乘法比加法慢(与编程人员用了几个技巧,以避免乘法),但Haswell的似乎它是周围的其他方式。不仅由于FMA还因为乘法可以使用​​两个端口,但除了只有一个。

所以我的问题(最终)是:


  1. 如何进行最优化时,另外是比较缓慢的乘法?

  2. 有没有改变我的算法,使用起来更加的乘法代数方法
    和补充少了?我知道有方法做相反,如(X + Y)*(X + Y) - (X * X + Y * Y)= 2 * X * Y 使用两个以上增补一少的乘法。

  3. 有没有办法简单地将df64_add功能(例如,使用FMA)?

在情况下,你想知道的两双方法比双慢十倍左右。这不是那么糟糕,我觉得好像有四核precision键入它很可能会至少两次为双慢,所以我的软件的方法是比我期望的硬件是否存在慢五倍硬件。

df64_add 组装

  vmovapd 8(%RSP),%ymm0
MOVQ%RDI,RAX%
vmovapd 72(%RSP),%ymm1
vmovapd 40(%RSP),%ymm3
vaddpd%ymm1,%ymm0,%ymm4
vmovapd 104(%RSP),%ymm5
vsubpd%ymm0,%ymm4,%ymm2
vsubpd%ymm2,%ymm1,%ymm1
vsubpd%ymm2,%ymm4,%ymm2
vsubpd%ymm2,%ymm0,%ymm0
vaddpd%ymm1,%ymm0,%ymm2
vaddpd%ymm5,%ymm3,%ymm1
vsubpd%ymm3,%ymm1,%ymm6
vsubpd%ymm6,%ymm5,%ymm5
vsubpd%ymm6,%ymm1,%ymm6
vaddpd%ymm1,%ymm2,%ymm1
vsubpd%ymm6,%ymm3,%ymm3
vaddpd%ymm1,%ymm4,%ymm2
vaddpd%ymm5,%ymm3,%ymm3
vsubpd%ymm4,%ymm2,%ymm4
vsubpd%ymm4,%ymm1,%ymm1
vaddpd%ymm3,%ymm1,%ymm0
vaddpd%ymm0,%ymm2,%ymm1
vsubpd%ymm2,%ymm1,%ymm2
vmovapd%ymm1(%RDI)
vsubpd%ymm2,%ymm0,%ymm0
vmovapd%ymm0,32(%RDI)
vzeroupper
RET


解决方案

要回答我,我发现了两双,另外一个更快的解决方案的第三个问题。我发现,在纸实施备选定义的
显卡上的浮浮运营商
硬件

 定理5(ADD22定理)设是啊+ Al和BH + BL以下的浮浮参数
算法:
ADD22(嗯,人,BH,BL)
1 R =⊕啊BH
2,如果|啊| ≥| BH |然后
3 S =(((啊⊖R)⊕BH)⊕b L)⊕A L
4É左发E
5秒=(((BH⊖R)⊕啊)⊕A L)⊕b升
6(RH,R L)= add12(R,S)
7回报(RH,R L)

下面是我如何实现这个(伪code):

 静态内嵌doubledoublen ADD22(doubledoublen常量和放大器;一,doubledouble常量和b){
    doublen AA,AB,AH,BH,AL,BL;
    booln口罩;
    AA = ABS(a.hi); // _ mm256_and_pd
    AB = ABS(b.hi);
    面膜= AA&GT; = AB; // _ mm256_cmple_pd
    // Z =选择(切,X,Y)为z =砍SIMD版本? X:Y;
    啊=选择(面具,a.hi,b.hi); // _ mm256_blendv_pd
    BH =选择(面具,b.hi,a.hi);
    AL =选择(面具,a.lo,b.lo);
    BL =选择(面具,b.lo,a.lo);    doublen R,S;
    R = +啊BH;
    S =(((啊 - R)+ BH)+ BL)+人;
    返回two_sum(R,S);
}

ADD22的定义采用11加法,而不是20,但它需要一些额外的code,以确定是否 |啊| &GT; = | BH | 这里是如何执行SIMD minmag和maxmag功能的讨论。幸运的是,大部分的额外code的未使用端口1现在只有12条指令到端口1,而不是20。

下面是一个吞吐量分析的形式<一个href=\"https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-it/26021338#26021338\">IACA新ADD22

 吞吐量分析报告
--------------------------
座吞吐量:12.05周期吞吐瓶颈:端口1端口绑定在循环迭代每:
-------------------------------------------------- -------------------------------------
|港| 0 - DV | 1 | 2 - 开发|的3 - D | 4 | 5 | 6 | 7 |
-------------------------------------------------- -------------------------------------
|循环| 0.0 0.0 | 12.0 | 2.5 2.5 | 2.5 2.5 | 2.0 | 10.0 | 0.0 | 2.0 |
-------------------------------------------------- -------------------------------------
|的num |端口pressure在周期| |
|微指令| 0 - DV | 1 | 2 - 开发|的3 - D | 4 | 5 | 6 | 7 | |
-------------------------------------------------- -------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | vmovapd ymm3,ymmword PTR [RIP]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | vmovapd ymm0,ymmword PTR [RDX]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | vmovapd ymm4,ymmword PTR [RSI]
| 1 | | | | | | 1.0 | | | | vandpd ymm2,ymm4,ymm3
| 1 | | | | | | 1.0 | | | | vandpd ymm3,ymm0,ymm3
| 1 | | 1.0 | | | | | | | CP | vcmppd ymm2,ymm3,ymm2,0X2
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | vmovapd ymm3,ymmword PTR [RSI + 0x20的]
| 2 | | | | | | 2.0 | | | | vblendvpd ymm1,ymm0,ymm4,ymm2
| 2 | | | | | | 2.0 | | | | vblendvpd ymm4,ymm4,ymm0,ymm2
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | vmovapd ymm0,ymmword PTR [RDX + 0x20的]
| 2 | | | | | | 2.0 | | | | vblendvpd ymm5,ymm0,ymm3,ymm2
| 2 | | | | | | 2.0 | | | | vblendvpd ymm0,ymm3,ymm0,ymm2
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm3,ymm1,ymm4
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm2,ymm1,ymm3
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm1,ymm2,ymm4
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm1,ymm1,ymm0
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm0,ymm1,ymm5
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm2,ymm3,ymm0
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm1,ymm2,ymm3
| 2 ^ | | | | | 1.0 | | | 1.0 | | vmovapd ymmword PTR [RDI],ymm2
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm0,ymm0,ymm1
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm1,ymm2,ymm1
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm3,ymm3,ymm1
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm0,ymm3,ymm0
| 2 ^ | | | | | 1.0 | | | 1.0 | | vmovapd ymmword PTR [RDI +为0x20],ymm0

和这里是从旧的吞吐量分析

 吞吐量分析报告
--------------------------
座吞吐量:20.00周期吞吐瓶颈:端口1端口绑定在循环迭代每:
-------------------------------------------------- -------------------------------------
|港| 0 - DV | 1 | 2 - 开发|的3 - D | 4 | 5 | 6 | 7 |
-------------------------------------------------- -------------------------------------
|循环| 0.0 0.0 | 20.0 | 2.0 2.0 | 2.0 2.0 | 2.0 | 0.0 | 0.0 | 2.0 |
-------------------------------------------------- -------------------------------------|的num |端口pressure在周期| |
|微指令| 0 - DV | 1 | 2 - 开发|的3 - D | 4 | 5 | 6 | 7 | |
-------------------------------------------------- -------------------------------
| 1 | | | 1.0 1.0 | | | | | | | vmovapd ymm0,ymmword PTR [RSI]
| 1 | | | | 1.0 1.0 | | | | | | vmovapd ymm1,ymmword PTR [RDX]
| 1 | | | 1.0 1.0 | | | | | | | vmovapd ymm3,ymmword PTR [RSI + 0x20的]
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm4,ymm0,ymm1
| 1 | | | | 1.0 1.0 | | | | | | vmovapd ymm5,ymmword PTR [RDX + 0x20的]
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm2,ymm4,ymm0
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm1,ymm1,ymm2
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm2,ymm4,ymm2
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm0,ymm0,ymm2
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm2,ymm0,ymm1
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm1,ymm3,ymm5
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm6,ymm1,ymm3
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm5,ymm5,ymm6
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm6,ymm1,ymm6
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm1,ymm2,ymm1
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm3,ymm3,ymm6
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm2,ymm4,ymm1
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm3,ymm3,ymm5
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm4,ymm2,ymm4
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm1,ymm1,ymm4
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm0,ymm1,ymm3
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm1,ymm2,ymm0
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm2,ymm1,ymm2
| 2 ^ | | | | | 1.0 | | | 1.0 | | vmovapd ymmword PTR [RDI],ymm1
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm0,ymm0,ymm2
| 2 ^ | | | | | 1.0 | | | 1.0 | | vmovapd ymmword PTR [RDI +为0x20],ymm0

有一个更好的解决办法,如果有除FMA 3操作数单舍入模式的说明。在我看来,应该有

单舍入模式的说明

  A + B + C
A * B + C // FMA - 这是唯一一个在86至今
A * B * C

When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this:

intn = 0;
for(int32_t i=0; i<maxiter; i++) {
    floatn x2 = square(x), y2 = square(y); //square(x) = x*x
    floatn r2 = x2 + y2;
    booln mask = r2<cut; //booln is in the float domain non integer domain
    if(!horizontal_or(mask)) break; //_mm256_testz_pd(mask)
    n -= mask
    floatn t = x*y; mul2(t); //mul2(t): t*=2
    x = x2 - y2 + cx;
    y = t + cy;
}

This determines if n pixels are in the Mandelbrot set. So for double floating point it runs over 4 pixels (floatn = __m256d, intn = __m256i). This requires 4 SIMD floating point multiplication and four SIMD floating point additions.

Then I modified this to work with FMA like this

intn n = 0; 
for(int32_t i=0; i<maxiter; i++) {
    floatn r2 = mul_add(x,x,y*y);
    booln mask = r2<cut;
    if(!horizontal_or(mask)) break;
    add_mask(n,mask);
    floatn t = x*y;
    x = mul_sub(x,x, mul_sub(y,y,cx));
    y = mul_add(2.0f,t,cy);
}

where mul_add calls _mm256_fmad_pd and mul_sub calls _mm256_fmsub_pd. This method uses 4 FMA SIMD operations, and two SIMD multiplications which is two less arithmetic operations then without FMA. Additionally, FMA and multiplication can use two ports and addition only one.

To make my tests less biased I zoomed into a region which is entirely in the Mandelbrot set so all the values are maxiter. In this case the method using FMA is about 27% faster. That's certainly an improvement but going from SSE to AVX doubled my performance so I was hoping for maybe another factor of two with FMA.

But then I found this answer in regards to FMA where it says

The important aspect of the fused-multiply-add instruction is the (virtually) infinite precision of the intermediate result. This helps with performance, but not so much because two operations are encoded in a single instruction — It helps with performance because the virtually infinite precision of the intermediate result is sometimes important, and very expensive to recover with ordinary multiplication and addition when this level of precision is really what the programmer is after.

and later gives an example of double*double to double-double multiplication

high = a * b; /* double-precision approximation of the real product */
low = fma(a, b, -high); /* remainder of the real product */

From this, I concluded that I was implementing FMA non-optimally and so I decided to implement SIMD double-double. I implemented double-double based on the paper Extended-Precision Floating-Point Numbers for GPU Computation. The paper is for double-float so I modified it for double-double. Additionally, instead of packing one double-double value in a SIMD registers I pack 4 double-double values into one AVX high register and one AVX low register.

For the Mandelbrot set what I really need is double-double multiplication and addition. In that paper these are the df64_add and df64_mult functions. The image below shows the assembly for my df64_mult function for software FMA (left) and hardware FMA (right). This clearly shows that hardware FMA is a big improvement for double-double multiplication.

So how does hardware FMA perform in the double-double Mandelbrot set calculation? The answer is that's only about 15% faster than with software FMA. That's much less than I hoped for. The double-double Mandelbrot calculation needs 4 double-double additions, and four double-double multiplications (x*x, y*y, x*y, and 2*(x*y)). However, the 2*(x*y) multiplication is trivial for double-double so this multiplication can be ignored in the cost. Therefore, the reason I think the improvement using hardware FMA is so small is that the calculation is dominated by the slow double-double addition (see the assembly below).

It used to be that multiplication was slower than addition (and programers used several tricks to avoid multiplication) but with Haswell it seems that it's the other way around. Not only due to FMA but also because multiplication can use two ports but addition only one.

So my questions (finally) are:

  1. How does one optimize when addition is slow compared to multiplication?
  2. Is there an algebraic way to change my algorithm to use more multiplications and less additions? I know there are method to do the reverse, e.g. (x+y)*(x+y) - (x*x+y*y) = 2*x*y which use two more additions for one less multiplication.
  3. Is there a way to simply the df64_add function (e.g. using FMA)?

In case anyone is wondering the double-double method is about ten times slower than double. That's not so bad I think as if there was a hardware quad-precision type it would likely be at least twice as slow as double so my software method is about five times slower than what I would expect for hardware if it existed.

df64_add assembly

vmovapd 8(%rsp), %ymm0
movq    %rdi, %rax
vmovapd 72(%rsp), %ymm1
vmovapd 40(%rsp), %ymm3
vaddpd  %ymm1, %ymm0, %ymm4
vmovapd 104(%rsp), %ymm5
vsubpd  %ymm0, %ymm4, %ymm2
vsubpd  %ymm2, %ymm1, %ymm1
vsubpd  %ymm2, %ymm4, %ymm2
vsubpd  %ymm2, %ymm0, %ymm0
vaddpd  %ymm1, %ymm0, %ymm2
vaddpd  %ymm5, %ymm3, %ymm1
vsubpd  %ymm3, %ymm1, %ymm6
vsubpd  %ymm6, %ymm5, %ymm5
vsubpd  %ymm6, %ymm1, %ymm6
vaddpd  %ymm1, %ymm2, %ymm1
vsubpd  %ymm6, %ymm3, %ymm3
vaddpd  %ymm1, %ymm4, %ymm2
vaddpd  %ymm5, %ymm3, %ymm3
vsubpd  %ymm4, %ymm2, %ymm4
vsubpd  %ymm4, %ymm1, %ymm1
vaddpd  %ymm3, %ymm1, %ymm0
vaddpd  %ymm0, %ymm2, %ymm1
vsubpd  %ymm2, %ymm1, %ymm2
vmovapd %ymm1, (%rdi)
vsubpd  %ymm2, %ymm0, %ymm0
vmovapd %ymm0, 32(%rdi)
vzeroupper
ret

解决方案

To answer my third question I found a faster solution for double-double addition. I found an alternative definition in the paper Implementation of float-float operators on graphics hardware.

Theorem 5 (Add22 theorem) Let be ah+al and bh+bl the float-float arguments of the following
algorithm:
Add22 (ah ,al ,bh ,bl)
1 r = ah ⊕ bh
2 if | ah | ≥ | bh | then
3     s = ((( ah ⊖ r ) ⊕ bh ) ⊕ b l ) ⊕ a l
4 e l s e
5     s = ((( bh ⊖ r ) ⊕ ah ) ⊕ a l ) ⊕ b l
6 ( rh , r l ) = add12 ( r , s )
7 return (rh , r l)

Here is how I implemented this (pseudo-code):

static inline doubledoublen add22(doubledoublen const &a, doubledouble const &b) {
    doublen aa,ab,ah,bh,al,bl;
    booln mask;
    aa = abs(a.hi);                //_mm256_and_pd
    ab = abs(b.hi); 
    mask = aa >= ab;               //_mm256_cmple_pd
    // z = select(cut,x,y) is a SIMD version of z = cut ? x : y;
    ah = select(mask,a.hi,b.hi);   //_mm256_blendv_pd
    bh = select(mask,b.hi,a.hi);
    al = select(mask,a.lo,b.lo);
    bl = select(mask,b.lo,a.lo);

    doublen r, s;
    r = ah + bh;
    s = (((ah - r) + bh) + bl ) + al;
    return two_sum(r,s);
}

This definition of Add22 uses 11 additions instead of 20 but it requires some additional code to determine if |ah| >= |bh|. Here is a discussion on how to implement SIMD minmag and maxmag functions. Fortunately, most of the additional code does not use port 1. Now only 12 instructions go to port 1 instead of 20.

Here is a throughput analysis form IACA for the new Add22

Throughput Analysis Report
--------------------------
Block Throughput: 12.05 Cycles       Throughput Bottleneck: Port1

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 12.0 | 2.5    2.5  | 2.5    2.5  | 2.0  | 10.0 | 0.0  | 2.0  |
---------------------------------------------------------------------------------------


| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rip]
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rdx]
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm4, ymmword ptr [rsi]
|   1    |           |     |           |           |     | 1.0 |     |     |    | vandpd ymm2, ymm4, ymm3
|   1    |           |     |           |           |     | 1.0 |     |     |    | vandpd ymm3, ymm0, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vcmppd ymm2, ymm3, ymm2, 0x2
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rsi+0x20]
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm1, ymm0, ymm4, ymm2
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm4, ymm4, ymm0, ymm2
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rdx+0x20]
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm5, ymm0, ymm3, ymm2
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm0, ymm3, ymm0, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm3, ymm1, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm1, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm1, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm3, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm2, ymm3
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi], ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm2, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm3, ymm3, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm3, ymm0
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi+0x20], ymm0

and here is the throughput analysis from the old

Throughput Analysis Report
--------------------------
Block Throughput: 20.00 Cycles       Throughput Bottleneck: Port1

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 20.0 | 2.0    2.0  | 2.0    2.0  | 2.0  | 0.0  | 0.0  | 2.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rsi]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovapd ymm1, ymmword ptr [rdx]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rsi+0x20]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm4, ymm0, ymm1
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovapd ymm5, ymmword ptr [rdx+0x20]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm4, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm1, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm4, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm0, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm3, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm6, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm5, ymm5, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm6, ymm1, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm3, ymm3, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm4, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm3, ymm3, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm4, ymm2, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm1, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm1, ymm2
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi], ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm2
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi+0x20], ymm0

A better solution would be if there were three operand single rounding mode instructions besides FMA. It seems to me there should be single rounding mode instructions for

a + b + c
a * b + c //FMA - this is the only one in x86 so far
a * b * c

这篇关于优化快速乘法但缓慢加入:FMA和doubledouble的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆