使用 AVX512 生成蒙版的 BMI [英] BMI for generating masks with AVX512

查看:52
本文介绍了使用 AVX512 生成蒙版的 BMI的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我受到这个链接的启发https://www.sigarch.org/simd-instructions-thinked-harmful/ 查看 AVX512 的性能.我的想法是可以使用 AVX512 掩码操作删除循环后的清理循环.

I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations.

这是我使用的代码

void daxpy2(int n, double a, const double x[], double y[]) {
  __m512d av = _mm512_set1_pd(a);
  int r = n&7, n2 = n - r;
  for(int i=-n2; i<0; i+=8) {
    __m512d yv = _mm512_loadu_pd(&y[i+n2]);
    __m512d xv = _mm512_loadu_pd(&x[i+n2]);
    yv = _mm512_fmadd_pd(av, xv, yv);
    _mm512_storeu_pd(&y[i+n2], yv);
  }
  __m512d yv = _mm512_loadu_pd(&y[n2]);
  __m512d xv = _mm512_loadu_pd(&x[n2]);
  yv = _mm512_fmadd_pd(av, xv, yv);
  __mmask8 mask = (1 << r) -1;
  //__mmask8 mask = _bextr_u32(-1, 0, r);
  _mm512_mask_storeu_pd(&y[n2], mask, yv);
}

我认为使用 BMI1 和/或 BMI2 指令可以生成指令更少的掩码.然而,

I thought using BMI1 and/or BMI2 instructions could generate masks with fewer instructions. However,

__mmask8 mask = _bextr_u32(-1, 0, r)

并不比

__mmask8 mask = (1 << r) -1;

参见 https://godbolt.org/z/BFQCM3https://godbolt.org/z/tesmB_.

这似乎是因为 _bextr_u32 无论如何都会移位 8.

This appears to be due to the fact that _bextr_u32 does a shift by 8 anyway.

是否可以使用更少的指令(例如使用 BMI 或其他方法)或更优化地生成掩码?

我已经用我的 AVX512 结果扩充了链接中的表格.

I have augmented the table in the link with my results for AVX512.

ISA                           | MIPS-32 | AVX2  | RV32V | AVX512 |
******************************|*********|****** |*******|******* |
Instructions(static)          |      22 |   29  |    13 |     28 |
Instructions per Main Loop    |       7 |    6* |    10 |      5*|
Bookkeeping Instructions      |      15 |   23  |     3 |     23 |
Results per Main Loop         |       2 |    4  |    64 |      8 |
Instructions (dynamic n=1000) |    3511 | 1517**|   163 |    645 |

*macro-op fusion will reduce the number of uops in the main loop by 1
** without the unnecessary cmp instructions it would only be 1250+ instructions.

我认为如果链接的作者从 -n 计数到 0 而不是从 0n 他们本可以像我一样在主循环中跳过 cmp 指令(参见下面的程序集),因此对于 AVX,它应该是主循环中的 5 条指令.

I think if the authors of the link had counted from -n up to 0 instead of from 0 to n they could have skipped the cmp instruction as I have (see the assembly below) in the main loop so for AVX it chould have been 5 instructions in the main loop.

这是带有 ICC19 和 -O3 -xCOMMON-AVX512

Here is the assembly with ICC19 and -O3 -xCOMMON-AVX512

daxpy2(int, double, double const*, double*):
    mov       eax, edi                                      #6.13
    and       eax, 7                                        #6.13
    movsxd    r9, edi                                       #6.25
    sub       r9, rax                                       #6.21
    mov       ecx, r9d                                      #7.14
    neg       ecx                                           #7.14
    movsxd    rcx, ecx                                      #7.14
    vbroadcastsd zmm16, xmm0                                #5.16
    lea       rdi, QWORD PTR [rsi+r9*8]                     #9.35
    lea       r8, QWORD PTR [rdx+r9*8]                      #8.35
    test      rcx, rcx                                      #7.20
    jge       ..B1.5        # Prob 36%                      #7.20
..B1.3:                         # Preds ..B1.1 ..B1.3
    vmovups   zmm17, ZMMWORD PTR [rdi+rcx*8]                #10.10
    vfmadd213pd zmm17, zmm16, ZMMWORD PTR [r8+rcx*8]        #10.10
    vmovups   ZMMWORD PTR [r8+rcx*8], zmm17                 #11.23
    add       rcx, 8                                        #7.23
    js        ..B1.3        # Prob 82%                      #7.20
..B1.5:                         # Preds ..B1.3 ..B1.1
    vmovups   zmm17, ZMMWORD PTR [rsi+r9*8]                 #15.8
    vfmadd213pd zmm16, zmm17, ZMMWORD PTR [rdx+r9*8]        #15.8
    mov       edx, -1                                       #17.19
    shl       eax, 8                                        #17.19
    bextr     eax, edx, eax                                 #17.19
    kmovw     k1, eax                                       #18.3
    vmovupd   ZMMWORD PTR [r8]{k1}, zmm16                   #18.3
    vzeroupper                                              #19.1
    ret                                                     #19.1

哪里

    add       r8, 8
    js        ..B1.3

应该宏操作融合到一条指令.但是,正如 Peter Cordes 在这个答案中指出的那样,js 无法融合.编译器本可以生成 jl 而不是它会融合.

should macro-op fuse to one instruction. However, as pointed out by Peter Cordes in this answer js cannot fuse. The compiler could have generated jl instead which would have fused.

我使用 Agner Fog 的 testp 实用程序来获取核心时钟(不是参考时钟),说明,uop 退休了.我为 SSE2(实际上是带有 FMA 但带有 128 位向量的 AVX2)、AVX2 和 AVX512 执行此操作,用于三种不同的循环变体

I used Agner Fog's testp utility to get the core clocks (not reference clocks), instructions, uops retired. I did this for SSE2 (actually AVX2 with FMA but with 128-bit vectors), AVX2, and AVX512 for three different variations of looping

v1 = for(int64_t i=0;   i<n;  i+=vec_size) // generates cmp instruction
v2 = for(int64_t i=-n2; i<0;  i+=vec_size) // no cmp but uses js
v3 = for(int64_t i=-n2; i!=0; i+=vec_size) // no cmp and uses jne

vec_size = 2 for SSE, 4 for AVX2, and 8 for AVX512

vec_size version   core cycle    instructions   uops
2        v1        895           3014           3524
2        v2        900           2518           3535
2        v3        870           2518           3035
4        v1        527           1513           1777
4        v2        520           1270           1777
4        v3        517           1270           1541
8        v1        285            765            910
8        v2        285            645            910
8        v3        285            645            790

请注意,核心时钟实际上并不是循环版本的函数.它只取决于循环的迭代.它与 2*n/vec_size 成正比.

Notice that core clocks is not really a function of the loop version. It only depends on iterations of the loop. It is proportional to 2*n/vec_size.

SSE     2*1000/2=1000
AVX2    2*1000/4=500
AVX512  2*1000/8=250

指令的数量确实从 v1 到 v2 变化,但在 v2 和 v3 之间没有变化.对于 v1,它与 6*n/vec_size 成正比,对于 v2 和 v3 5*n/vec_size

The number of instructions does change from v1 to v2 but not between v2 and v3. For v1 it is proportional to 6*n/vec_size and for v2 and v3 5*n/vec_size

最后,v1 和 v2 的 uops 数量或多或少相同,但 v3 下降.对于 v1 和 v2,它与 7*n/vec_size 和 v3 6*n/vec_size 成正比.

Finally, the number of uops is more or less the same for v1 and v2 but drops for v3. For v1 and v2 it is proportional to 7*n/vec_size and for v3 6*n/vec_size.

这是 IACA3 中 vec_size=2 的结果

Here is the result with IACA3 for vec_size=2

Throughput Analysis Report
--------------------------
Block Throughput: 1.49 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  50
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  0.5     0.0  |  0.5  |  1.5     1.0  |  1.5     1.0  |  1.0  |  0.0  |  0.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | vmovupd xmm1, xmmword ptr [r8+rax*8]
|   2      | 0.5         | 0.5  | 0.5     0.5 | 0.5     0.5 |      |      |      |      | vfmadd213pd xmm1, xmm2, xmmword ptr [rcx+rax*8]
|   2      |             |      | 0.5         | 0.5         | 1.0  |      |      |      | vmovups xmmword ptr [rcx+rax*8], xmm1
|   1*     |             |      |             |             |      |      |      |      | add rax, 0x2
|   0*F    |             |      |             |             |      |      |      |      | js 0xffffffffffffffe3
Total Num Of Uops: 6

IACA 声称 jsadd 宏融合,这与 Agner 和来自 testp 实用程序的性能计数器不一致.见上文,v2 与 7*n/vec_size 成正比,v3 与 6*n/vec_size 成正比,我推断这意味着 js 确实如此不是宏保险丝.

IACA claims that js macro-fuses with add that disagrees with Agner and the performance counters from the testp utility. See above, v2 is proportional to 7*n/vec_size and v3 proportional to 6*n/vec_size which I infer to mean that js does not macro-fuse.

我认为除了指令数量之外,链接的作者还应该考虑核心周期,也许还有 uops.

I think the authors of the link in additional to number of instructions should have also considered core cycles and maybe uops.

推荐答案

如果您使用以下 BMI2 内在函数,您可以保存一条指令:

You can save one instruction if you use the following BMI2 intrinsic:

  __mmask8 mask = _bzhi_u32(-1, r);

而不是 __mmask8 掩码 = (1 << r) -1;.请参阅 Godbolt 链接.

instead of __mmask8 mask = (1 << r) -1;. See Godbolt link.

bzhi 指令 将指定位置开始的高位归零位置.使用寄存器操作数时,bzhi 的延迟为 1 个周期,每个周期的吞吐量为 2 个.

The bzhi instruction zeros the high bits starting at a specified position. With register operands, bzhi has a latency of 1 cycle and a throughput of 2 per cycle.

这篇关于使用 AVX512 生成蒙版的 BMI的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆