使用 AVX512 生成蒙版的 BMI [英] BMI for generating masks with AVX512
问题描述
我受到这个链接的启发https://www.sigarch.org/simd-instructions-thinked-harmful/ 查看 AVX512 的性能.我的想法是可以使用 AVX512 掩码操作删除循环后的清理循环.
I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations.
这是我使用的代码
void daxpy2(int n, double a, const double x[], double y[]) {
__m512d av = _mm512_set1_pd(a);
int r = n&7, n2 = n - r;
for(int i=-n2; i<0; i+=8) {
__m512d yv = _mm512_loadu_pd(&y[i+n2]);
__m512d xv = _mm512_loadu_pd(&x[i+n2]);
yv = _mm512_fmadd_pd(av, xv, yv);
_mm512_storeu_pd(&y[i+n2], yv);
}
__m512d yv = _mm512_loadu_pd(&y[n2]);
__m512d xv = _mm512_loadu_pd(&x[n2]);
yv = _mm512_fmadd_pd(av, xv, yv);
__mmask8 mask = (1 << r) -1;
//__mmask8 mask = _bextr_u32(-1, 0, r);
_mm512_mask_storeu_pd(&y[n2], mask, yv);
}
我认为使用 BMI1 和/或 BMI2 指令可以生成指令更少的掩码.然而,
I thought using BMI1 and/or BMI2 instructions could generate masks with fewer instructions. However,
__mmask8 mask = _bextr_u32(-1, 0, r)
并不比
__mmask8 mask = (1 << r) -1;
参见 https://godbolt.org/z/BFQCM3 和 https://godbolt.org/z/tesmB_.
这似乎是因为 _bextr_u32 无论如何都会移位 8.
This appears to be due to the fact that _bextr_u32 does a shift by 8 anyway.
是否可以使用更少的指令(例如使用 BMI 或其他方法)或更优化地生成掩码?
我已经用我的 AVX512 结果扩充了链接中的表格.
I have augmented the table in the link with my results for AVX512.
ISA | MIPS-32 | AVX2 | RV32V | AVX512 |
******************************|*********|****** |*******|******* |
Instructions(static) | 22 | 29 | 13 | 28 |
Instructions per Main Loop | 7 | 6* | 10 | 5*|
Bookkeeping Instructions | 15 | 23 | 3 | 23 |
Results per Main Loop | 2 | 4 | 64 | 8 |
Instructions (dynamic n=1000) | 3511 | 1517**| 163 | 645 |
*macro-op fusion will reduce the number of uops in the main loop by 1
** without the unnecessary cmp instructions it would only be 1250+ instructions.
我认为如果链接的作者从 -n
计数到 0
而不是从 0
到 n
他们本可以像我一样在主循环中跳过 cmp
指令(参见下面的程序集),因此对于 AVX,它应该是主循环中的 5 条指令.
I think if the authors of the link had counted from -n
up to 0
instead of from 0
to n
they could have skipped the cmp
instruction as I have (see the assembly below) in the main loop so for AVX it chould have been 5 instructions in the main loop.
这是带有 ICC19 和 -O3 -xCOMMON-AVX512
Here is the assembly with ICC19 and -O3 -xCOMMON-AVX512
daxpy2(int, double, double const*, double*):
mov eax, edi #6.13
and eax, 7 #6.13
movsxd r9, edi #6.25
sub r9, rax #6.21
mov ecx, r9d #7.14
neg ecx #7.14
movsxd rcx, ecx #7.14
vbroadcastsd zmm16, xmm0 #5.16
lea rdi, QWORD PTR [rsi+r9*8] #9.35
lea r8, QWORD PTR [rdx+r9*8] #8.35
test rcx, rcx #7.20
jge ..B1.5 # Prob 36% #7.20
..B1.3: # Preds ..B1.1 ..B1.3
vmovups zmm17, ZMMWORD PTR [rdi+rcx*8] #10.10
vfmadd213pd zmm17, zmm16, ZMMWORD PTR [r8+rcx*8] #10.10
vmovups ZMMWORD PTR [r8+rcx*8], zmm17 #11.23
add rcx, 8 #7.23
js ..B1.3 # Prob 82% #7.20
..B1.5: # Preds ..B1.3 ..B1.1
vmovups zmm17, ZMMWORD PTR [rsi+r9*8] #15.8
vfmadd213pd zmm16, zmm17, ZMMWORD PTR [rdx+r9*8] #15.8
mov edx, -1 #17.19
shl eax, 8 #17.19
bextr eax, edx, eax #17.19
kmovw k1, eax #18.3
vmovupd ZMMWORD PTR [r8]{k1}, zmm16 #18.3
vzeroupper #19.1
ret #19.1
哪里
add r8, 8
js ..B1.3
应该宏操作融合到一条指令.但是,正如 Peter Cordes 在这个答案中指出的那样,js 无法融合.编译器本可以生成 jl
而不是它会融合.
should macro-op fuse to one instruction. However, as pointed out by Peter Cordes in this answer js cannot fuse. The compiler could have generated jl
instead which would have fused.
我使用 Agner Fog 的 testp 实用程序来获取核心时钟(不是参考时钟),说明,uop 退休了.我为 SSE2(实际上是带有 FMA 但带有 128 位向量的 AVX2)、AVX2 和 AVX512 执行此操作,用于三种不同的循环变体
I used Agner Fog's testp utility to get the core clocks (not reference clocks), instructions, uops retired. I did this for SSE2 (actually AVX2 with FMA but with 128-bit vectors), AVX2, and AVX512 for three different variations of looping
v1 = for(int64_t i=0; i<n; i+=vec_size) // generates cmp instruction
v2 = for(int64_t i=-n2; i<0; i+=vec_size) // no cmp but uses js
v3 = for(int64_t i=-n2; i!=0; i+=vec_size) // no cmp and uses jne
vec_size = 2 for SSE, 4 for AVX2, and 8 for AVX512
vec_size version core cycle instructions uops
2 v1 895 3014 3524
2 v2 900 2518 3535
2 v3 870 2518 3035
4 v1 527 1513 1777
4 v2 520 1270 1777
4 v3 517 1270 1541
8 v1 285 765 910
8 v2 285 645 910
8 v3 285 645 790
请注意,核心时钟实际上并不是循环版本的函数.它只取决于循环的迭代.它与 2*n/vec_size
成正比.
Notice that core clocks is not really a function of the loop version. It only depends on iterations of the loop. It is proportional to 2*n/vec_size
.
SSE 2*1000/2=1000
AVX2 2*1000/4=500
AVX512 2*1000/8=250
指令的数量确实从 v1 到 v2 变化,但在 v2 和 v3 之间没有变化.对于 v1,它与 6*n/vec_size
成正比,对于 v2 和 v3 5*n/vec_size
The number of instructions does change from v1 to v2 but not between v2 and v3. For v1 it is proportional to 6*n/vec_size
and for v2 and v3 5*n/vec_size
最后,v1 和 v2 的 uops 数量或多或少相同,但 v3 下降.对于 v1 和 v2,它与 7*n/vec_size
和 v3 6*n/vec_size
成正比.
Finally, the number of uops is more or less the same for v1 and v2 but drops for v3. For v1 and v2 it is proportional to 7*n/vec_size
and for v3 6*n/vec_size
.
这是 IACA3 中 vec_size=2 的结果
Here is the result with IACA3 for vec_size=2
Throughput Analysis Report
--------------------------
Block Throughput: 1.49 Cycles Throughput Bottleneck: FrontEnd
Loop Count: 50
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 1.5 1.0 | 1.5 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovupd xmm1, xmmword ptr [r8+rax*8]
| 2 | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vfmadd213pd xmm1, xmm2, xmmword ptr [rcx+rax*8]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rcx+rax*8], xmm1
| 1* | | | | | | | | | add rax, 0x2
| 0*F | | | | | | | | | js 0xffffffffffffffe3
Total Num Of Uops: 6
IACA 声称 js
与 add
宏融合,这与 Agner 和来自 testp
实用程序的性能计数器不一致.见上文,v2 与 7*n/vec_size
成正比,v3 与 6*n/vec_size
成正比,我推断这意味着 js
确实如此不是宏保险丝.
IACA claims that js
macro-fuses with add
that disagrees with Agner and the performance counters from the testp
utility. See above, v2 is proportional to 7*n/vec_size
and v3 proportional to 6*n/vec_size
which I infer to mean that js
does not macro-fuse.
我认为除了指令数量之外,链接的作者还应该考虑核心周期,也许还有 uops.
I think the authors of the link in additional to number of instructions should have also considered core cycles and maybe uops.
推荐答案
如果您使用以下 BMI2 内在函数,您可以保存一条指令:
You can save one instruction if you use the following BMI2 intrinsic:
__mmask8 mask = _bzhi_u32(-1, r);
而不是 __mmask8 掩码 = (1 << r) -1;
.请参阅 Godbolt 链接.
instead of __mmask8 mask = (1 << r) -1;
. See Godbolt link.
bzhi
指令 将指定位置开始的高位归零位置.使用寄存器操作数时,bzhi
的延迟为 1 个周期,每个周期的吞吐量为 2 个.
The bzhi
instruction zeros the high bits starting at a specified position. With register operands, bzhi
has a latency of 1 cycle and a throughput of 2 per cycle.
这篇关于使用 AVX512 生成蒙版的 BMI的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!