使用xmm寄存器的AMD Jaguar / Bulldozer / Zen上的vxorps调零是否比ymm快? [英] Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

查看:186
本文介绍了使用xmm寄存器的AMD Jaguar / Bulldozer / Zen上的vxorps调零是否比ymm快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

AMD CPU通过解码为两个128b操作来处理256b AVX指令。例如AMD Steamroller上的 vaddps ymm0,ymm1,ymm1 解码为2个宏操作,吞吐量是 vaddps xmm0,xmm1,xmm1

AMD CPUs handle 256b AVX instructions by decoding into two 128b operations. e.g. vaddps ymm0, ymm1,ymm1 on AMD Steamroller decodes to 2 macro-ops, with half the throughput of vaddps xmm0, xmm1,xmm1.

异或归零是一种特殊情况(无输入依赖性,并且在Jaguar上至少可以避免消耗物理寄存器文件条目,并且可以使该寄存器中的movdqa在发布/重命名时被消除,就像Bulldozer一直都在进行,即使对于未归零的regs )。 但是是否足够早地检测到 vxorps ymm0,ymm0,ymm0 仍然只能解码为1个性能与 vxorps xmm0相同的宏运算, xmm0,xmm0 ? (与 vxorps ymm3,ymm2,ymm1 不同)

XOR-zeroing is a special case (no input dependency, and on Jaguar at least avoids consuming a physical register file entry, and enables movdqa from that register to be eliminated at issue/rename, like Bulldozer does all the time even for non-zerod regs). But is it detected early enough that vxorps ymm0,ymm0,ymm0 still only decodes to 1 macro-op with equal performance to vxorps xmm0,xmm0,xmm0? (unlike vxorps ymm3, ymm2,ymm1)

还是在以后进行独立检测已经解码成两块了?另外,AMD CPU上的向量异或归零是否仍使用执行端口?在Intel-CPU上,Nehalem需要一个端口,但Sandybridge-family在问题/重命名阶段对其进行处理。

Or does independence-detection happen later, after already decoding into two uops? Also, does vector xor-zeroing on AMD CPUs still use an execution port? On Intel-CPUs, Nehalem needs a port but Sandybridge-family handles it in the issue/rename stage.

Agner Fog的指令表没有列出这种特殊情况,

Agner Fog's instruction tables don't list this special-case, and his microarch guide doesn't mention the number of uops.

这可能意味着 vxorps xmm0 ,xmm0,xmm0 是实现 _mm256_setzero_ps()的更好方法。

This could mean vxorps xmm0,xmm0,xmm0 is a better way to implement _mm256_setzero_ps().

对于AVX512 _mm512_setzero_ps()还通过仅使用VEX编码的归零习惯而不是EVEX来保存字节。 (即对于zmm0-15。 vxorps xmm31,xmm31,xmm31 仍需要EVEX)。目前,gcc / clang使用所需的任何寄存器宽度的xor调零习惯,而不是始终使用AVX-128。

For AVX512, _mm512_setzero_ps() also saves a byte by using only a VEX-coded zeroing idiom, rather than EVEX, when possible. (i.e. for zmm0-15. vxorps xmm31,xmm31,xmm31 would still require an EVEX). gcc/clang currently use xor-zeroing idioms of whatever register-width they want, rather than always using AVX-128.

报告为clang 错误32862 和gcc 错误80636 。 MSVC已使用 xmm 。尚未报告给ICC,后者还使用zmm regs进行AVX512归零。 (尽管英特尔可能不会在意更改,因为当前没有任何英特尔CPU受益,只有AMD。如果他们曾经发布过将矢量分成两半的低功耗CPU,他们可能会这样做。他们目前的低功耗deisgn(Silvermont)不会。完全不支持AVX,仅支持SSE4。)

Reported as clang bug 32862 and gcc bug 80636. MSVC already uses xmm. Not yet reported to ICC, which also uses zmm regs for AVX512 zeroing. (Although Intel might not care to change since there's currently no benefit on any Intel CPUs, only AMD. If they ever release a low-power CPU that splits vectors in half, they might. Their current low-power deisgn (Silvermont) doesn't support AVX at all, only SSE4.)

我知道使用AVX-128指令的唯一可能缺点将256b寄存器清零的原因是它不会触发Intel CPU上256b执行单元的预热。

The only possible downside I know of to using an AVX-128 instruction for zeroing a 256b register is that it doesn't trigger warm-up of the 256b execution units on Intel CPUs. Possibly defeating a C or C++ hack that tries to warm them up.

(在第一个256b指令之后的前约56k周期中,256b矢量指令的速度较慢。请参见Skylake Agner Fog的microarch pdf部分)。如果调用 noinline 函数返回 _mm256_setzero_ps 并不是预热执行单元的可靠方法,那可能没问题。 (在没有AVX2的情况下仍然可以使用,并且避免了任何负载(可能会缓存未命中)的是 __ m128 onebits = _mm_castsi128_ps(_mm_set1_epi8(0xff));

返回_mm256_insertf128_ps(_mm256_castps128_ps256(onebits,onebits))应编译为 pcmpeqd xmm0,xmm0,xmm0 / vinsertf128 ymm0,xmm0,1 。对于在一次关键循环之前就进行预热(或保持预热)执行单元的调用来说,这仍然是微不足道的。

(256b vector instructions are slower for the first ~56k cycles after the first 256b instruction. See the Skylake section in Agner Fog's microarch pdf). It's probably ok if calling a noinline function that returns _mm256_setzero_ps isn't a reliable way to warm up the execution units. (One that still works without AVX2, and avoids any loads (that could cache miss) is __m128 onebits = _mm_castsi128_ps(_mm_set1_epi8(0xff));
return _mm256_insertf128_ps(_mm256_castps128_ps256(onebits), onebits) which should compile to pcmpeqd xmm0,xmm0,xmm0 / vinsertf128 ymm0,xmm0,1. That's still pretty trivial for something you call once to warm-up (or keep warm) the execution units well ahead of a critical loop. And if you want something that can inline, you probably need inline-asm.)

我没有AMD硬件,所以我不能

I don't have AMD hardware so I can't test this.

如果有人拥有AMD硬件但不知道如何测试,请使用性能计数器来计数周期(最好是m-ops或uops或AMD所说的任何东西)

If anyone has AMD hardware but doesn't know how to test, use perf counters to count cycles (and preferably m-ops or uops or whatever AMD calls them).

这是我用来测试短序列的NASM / YASM来源:

This is the NASM/YASM source I use to test short sequences:

section .text
global _start
_start:

    mov     ecx, 250000000

align 32  ; shouldn't matter, but just in case
.loop:

    dec     ecx  ; prevent macro-fusion by separating this from jnz, to avoid differences on CPUs that can't macro-fuse

%rep 6
    ;    vxorps  xmm1, xmm1, xmm1
    vxorps  ymm1, ymm1, ymm1
%endrep

    jnz .loop

    xor edi,edi
    mov eax,231    ; exit_group(0) on x86-64 Linux
    syscall

如果您不在Linux上,也许用 ret 替换循环(退出系统调用)之后的内容,然后从C main()

If you're not on Linux, maybe replace the stuff after the loop (the exit syscall) with a ret, and call the function from a C main() function.

nasm -felf64 vxor-zero.asm& ld -o vxor-zero vxor-zero.o 生成静态二进制文件。 (或使用 asm-link 脚本,我在Q& A中发布了有关使用/不使用libc组装静态/动态二进制文件的脚本)。

Assemble with nasm -felf64 vxor-zero.asm && ld -o vxor-zero vxor-zero.o to make a static binary. (Or use the asm-link script I posted in a Q&A about assembling static/dynamic binaries with/without libc).

在3.9GHz的i7-6700k(英特尔Skylake)上输出的示例。 (IDK为什么我的计算机在闲置几分钟后只能达到3.9 GHz。启动后,Turbo高达4.2或4.4 GHz可以正常工作)。由于我使用的是性能计数器,因此计算机运行的时钟速度实际上并不重要。没有涉及加载/存储或代码缓存未命中,因此所有内容的核心时钟周期数都是恒定的,而不管它们有多长时间。

Example output on an i7-6700k (Intel Skylake), at 3.9GHz. (IDK why my machine only goes up to 3.9GHz after it's been idle a few minutes. Turbo up to 4.2 or 4.4GHz works normally right after boot). Since I'm using perf counters, it doesn't actually matter what clock speed the machine is running. No loads/stores or code-cache misses are involved, so the number of core-clock-cycles for everything is constant regardless of how long they are.

$ alias disas='objdump -drwC -Mintel'
$ b=vxor-zero;  asm-link "$b.asm" && disas "$b" && ocperf.py stat -etask-clock,cycles,instructions,branches,uops_issued.any,uops_retired.retire_slots,uops_executed.thread -r4 "./$b"
+ yasm -felf64 -Worphan-labels -gdwarf2 vxor-zero.asm
+ ld -o vxor-zero vxor-zero.o

vxor-zero:     file format elf64-x86-64


Disassembly of section .text:

0000000000400080 <_start>:
  400080:       b9 80 b2 e6 0e          mov    ecx,0xee6b280
  400085:       66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00    data16 data16 data16 data16 data16 nop WORD PTR cs:[rax+rax*1+0x0]
  400094:       66 66 66 2e 0f 1f 84 00 00 00 00 00     data16 data16 nop WORD PTR cs:[rax+rax*1+0x0]

00000000004000a0 <_start.loop>:
  4000a0:       ff c9                   dec    ecx
  4000a2:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000a6:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000aa:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000ae:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000b2:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000b6:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000ba:       75 e4                   jne    4000a0 <_start.loop>
  4000bc:       31 ff                   xor    edi,edi
  4000be:       b8 e7 00 00 00          mov    eax,0xe7
  4000c3:       0f 05                   syscall

(ocperf.py is a wrapper with symbolic names for CPU-specific events.  It prints the perf command it actually ran):

perf stat -etask-clock,cycles,instructions,branches,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xc2,umask=0x2,name=uops_retired_retire_slots/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/ -r4 ./vxor-zero

 Performance counter stats for './vxor-zero' (4 runs):

        128.379226      task-clock:u (msec)       #    0.999 CPUs utilized            ( +-  0.07% )
       500,072,741      cycles:u                  #    3.895 GHz                      ( +-  0.01% )
     2,000,000,046      instructions:u            #    4.00  insn per cycle           ( +-  0.00% )
       250,000,040      branches:u                # 1947.356 M/sec                    ( +-  0.00% )
     2,000,012,004      uops_issued_any:u         # 15578.938 M/sec                   ( +-  0.00% )
     2,000,008,576      uops_retired_retire_slots:u # 15578.911 M/sec                   ( +-  0.00% )
       500,009,692      uops_executed_thread:u    # 3894.787 M/sec                    ( +-  0.00% )

       0.128516502 seconds time elapsed                                          ( +-  0.09% )

+-0.02%的内容是因为我运行了 perf stat -r4 ,所以它运行了我的二进制文件4次。

The +- 0.02% stuff is because I ran perf stat -r4, so it ran my binary 4 times.

uops_issued_any uops_retired_retire_slots 是融合域(前端吞吐量限制在Skylake和Bulldozer系列上每个时钟4个时钟)。计数几乎相同,因为没有分支预测错误(导致投机发行的uops被丢弃而不是退休)。

uops_issued_any and uops_retired_retire_slots are fused-domain (front-end throughput limit of 4 per clock on Skylake and Bulldozer-family). The counts are nearly identical because there are no branch mispredicts (which lead to speculatively-issued uops being discarded instead of retired).

uops_executed_thread 是未融合域的指令(执行端口)。 异或归零在Intel CPU上不需要任何操作,因此实际执行的只是dec和branch。 (如果我们将操作数更改为vxorps,那么它不仅仅是将寄存器清零,例如 vxorps ymm2,ymm1,ymm0 将输出写入到下一个寄存器不做的寄存器中尚未读取,执行的uops将与融合域的uop计数匹配。我们会看到吞吐量限制为每个时钟三个vxorps。)

uops_executed_thread is unfused-domain uops (execution ports). xor-zeroing doesn't need any on Intel CPUs, so it's just the dec and branch uops that actually execute. (If we changed the operands to vxorps so it wasn't just zeroing a register, e.g. vxorps ymm2, ymm1,ymm0 to write the output to a register that the next one doesn't read, uops executed will match the fused-domain uop count. And we'd see that the throughput limit is three vxorps per clock.)

2000M融合域在500M个时钟周期内发出的uops是每个时钟4.0 uops:实现理论上最大的前端吞吐量。 6 * 250是1500,因此这些计数与Skylake解码 vxorps ymm,ymm,ymm 匹配到1个融合域uop相符。

2000M fused-domain uops issued in 500M clock cycles is 4.0 uops issued per clock: achieving the theoretical max front-end throughput. 6 * 250 is 1500, so these counts match with Skylake decoding vxorps ymm,ymm,ymm to 1 fused-domain uop.

在循环中存在不同数量的uops时,情况并不理想。例如一个5 uop的循环仅以每个时钟3.75 ups发出。我特意将其选择为8微秒(当vxorps解码为单微秒时)。

With a different number of uops in the loop, things aren't as good. e.g. a 5 uop loop only issued at 3.75 uops per clock. I intentionally chose this to be 8 uops (when vxorps decodes to a single-uop).

Zen的问题宽度是每个周期6微微秒,所以它可以做到展开量不同时效果更好。 (请参阅此问题与解答; A 进一步了解在英特尔SnB系列uarches上其uop计数不是问题宽度倍数的短循环)。

The issue-width of Zen is 6 uops per cycle, so it may do better with a different amount of unrolling. (See this Q&A for more about short loops whose uop count isn't a multiple of the issue width, on Intel SnB-family uarches).

推荐答案

将ymm寄存器与自身进行异或运算将在AMD Ryzen上生成两个微操作,而将xmm寄存器与自身进行异或运算仅生成一个微操作。因此,对ymm寄存器进行异化的最佳方法是将相应的xmm寄存器与其自身进行异或,并依靠隐式零扩展。

xor'ing a ymm register with itself generates two micro-ops on AMD Ryzen, while xor'ing an xmm register with itself generates only one micro-op. So the optimal way of xeroing a ymm register is to xor the corresponding xmm register with itself and rely on implicit zero extension.

当今唯一支持AVX512的处理器是Knights Landing。它使用一个微型运算器对zmm寄存器进行异或。通过将向量大小一分为二来处理向量大小的新扩展是很常见的。从64位到128位的转换以及从128位到256位的转换都发生了这种情况。将来,某些处理器(来自AMD或Intel或任何其他供应商)很有可能会将512位向量分成两个256位向量,甚至四个128位向量。因此,将zmm寄存器置零的最佳方法是将128位寄存器与自身进行异或,并依靠零扩展。没错,128位VEX编码的指令短了一个或两个字节。

The only processor that supports AVX512 today is Knights Landing. It uses a single micro-op for xor'ing a zmm register. It is very common to handle a new extension of vector size by splitting it in two. This happened with the transition from 64 to 128 bits and with the transition from 128 to 256 bits. It is more than likely that some processors in the future (from AMD or Intel or any other vendor) will split 512-bit vectors into two 256-bit vectors or even four 128-bit vectors. So the optimal way to zero a zmm register is to xor the 128-bit register with itself and rely on zero extension. And you are right, the 128-bit VEX-coded instruction is one or two bytes shorter.

大多数处理器将寄存器的异或与自身无关。寄存器的先前值。

Most processors recognize the xor of a register with itself to be independent of the previous value of the register.

这篇关于使用xmm寄存器的AMD Jaguar / Bulldozer / Zen上的vxorps调零是否比ymm快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆