使用 xmm 寄存器在 AMD Jaguar/Bulldozer/Zen 上进行 vxorps 归零是否比 ymm 更快? [英] Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

查看:35
本文介绍了使用 xmm 寄存器在 AMD Jaguar/Bulldozer/Zen 上进行 vxorps 归零是否比 ymm 更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

AMD CPU 通过解码为两个 128b 操作来处理 256b AVX 指令.例如AMD Steamroller 上的 vaddps ymm0, ymm1,ymm1 解码为 2 个宏操作,吞吐量是 vaddps xmm0, xmm1,xmm1 的一半.

AMD CPUs handle 256b AVX instructions by decoding into two 128b operations. e.g. vaddps ymm0, ymm1,ymm1 on AMD Steamroller decodes to 2 macro-ops, with half the throughput of vaddps xmm0, xmm1,xmm1.

XOR-zeroing 是一种特殊情况(没有输入依赖,在 Jaguar 上至少避免了消耗物理注册文件条目,并允许该寄存器中的 movdqa 在发布/重命名时被消除,就像推土机一直在做,即使对于非零注册).但是它是否足够早地检测到 vxorps ymm0,ymm0,ymm0 仍然只解码为 1 个宏操作,其性能与 vxorps xmm0,xm​​m0,xm​​m0 相同?(不同于vxorps ymm3, ymm2,ymm1)

XOR-zeroing is a special case (no input dependency, and on Jaguar at least avoids consuming a physical register file entry, and enables movdqa from that register to be eliminated at issue/rename, like Bulldozer does all the time even for non-zerod regs). But is it detected early enough that vxorps ymm0,ymm0,ymm0 still only decodes to 1 macro-op with equal performance to vxorps xmm0,xmm0,xmm0? (unlike vxorps ymm3, ymm2,ymm1)

还是在已经解码为两个 uops 之后才进行独立检测?另外,AMD CPU 上的向量异或归零仍然使用执行端口吗?在 Intel-CPU 上,Nehalem 需要一个端口,但 Sandybridge 系列在问题/重命名阶段处理它.

Or does independence-detection happen later, after already decoding into two uops? Also, does vector xor-zeroing on AMD CPUs still use an execution port? On Intel-CPUs, Nehalem needs a port but Sandybridge-family handles it in the issue/rename stage.

Agner Fog 的指令表没有列出这种特殊情况,他的微架构指南也没有提到 uops 的数量.

Agner Fog's instruction tables don't list this special-case, and his microarch guide doesn't mention the number of uops.

这可能意味着 vxorps xmm0,xm​​m0,xm​​m0 是实现 _mm256_setzero_ps() 的更好方法.

This could mean vxorps xmm0,xmm0,xmm0 is a better way to implement _mm256_setzero_ps().

对于 AVX512,_mm512_setzero_ps() 还通过仅使用 VEX 编码的归零习惯用法而不是 EVEX(如果可能)来保存一个字节.(即对于 zmm0-15.vxorps xmm31,xmm31,xmm31 仍然需要 EVEX).gcc/clang 目前使用他们想要的任何寄存器宽度的异或归零习语,而不是总是使用 AVX-128.

For AVX512, _mm512_setzero_ps() also saves a byte by using only a VEX-coded zeroing idiom, rather than EVEX, when possible. (i.e. for zmm0-15. vxorps xmm31,xmm31,xmm31 would still require an EVEX). gcc/clang currently use xor-zeroing idioms of whatever register-width they want, rather than always using AVX-128.

报告为 clang bug 32862 和 gcc 错误 80636.MSVC 已经使用 xmm.尚未向 ICC 报告,ICC 也使用 zmm regs 进行 AVX512 归零.(尽管英特尔可能不关心改变,因为目前任何英特尔 CPU 都没有任何好处,只有 AMD.如果他们发布将向量分成两半的低功耗 CPU,他们可能会.他们目前的低功耗设计(Silvermont)不会完全不支持AVX,只支持SSE4.)

Reported as clang bug 32862 and gcc bug 80636. MSVC already uses xmm. Not yet reported to ICC, which also uses zmm regs for AVX512 zeroing. (Although Intel might not care to change since there's currently no benefit on any Intel CPUs, only AMD. If they ever release a low-power CPU that splits vectors in half, they might. Their current low-power deisgn (Silvermont) doesn't support AVX at all, only SSE4.)

我所知道的使用 AVX-128 指令将 256b 寄存器清零的唯一可能的缺点是它不会触发 Intel CPU 上 256b 执行单元的预热.可能会打败试图让它们升温的 C 或 C++ hack.

The only possible downside I know of to using an AVX-128 instruction for zeroing a 256b register is that it doesn't trigger warm-up of the 256b execution units on Intel CPUs. Possibly defeating a C or C++ hack that tries to warm them up.

(256b 向量指令在第一个 256b 指令之后的前 ~56k 个周期中较慢.请参阅 Agner Fog 的 microarch pdf 中的 Skylake 部分).如果调用返回 _mm256_setzero_psnoinline 函数不是预热执行单元的可靠方法,那可能没问题.(在没有 AVX2 的情况下仍然可以工作并避免任何负载(可能缓存未命中)的是 __m128 onebits = _mm_castsi128_ps(_mm_set1_epi8(0xff));
return _mm256_insertf128_ps(_mm256_castps128_ps256(onebits), onebits) 应该编译为 pcmpeqd xmm0,xm​​m0,xm​​m0/vinsertf128 ymm0,xm​​m0,1对于您调用一次以在关键循环之前预热(或保持温暖)执行单元的东西,这仍然是微不足道的.如果你想要可以内联的东西,你可能需要 inline-asm.)

(256b vector instructions are slower for the first ~56k cycles after the first 256b instruction. See the Skylake section in Agner Fog's microarch pdf). It's probably ok if calling a noinline function that returns _mm256_setzero_ps isn't a reliable way to warm up the execution units. (One that still works without AVX2, and avoids any loads (that could cache miss) is __m128 onebits = _mm_castsi128_ps(_mm_set1_epi8(0xff));
return _mm256_insertf128_ps(_mm256_castps128_ps256(onebits), onebits) which should compile to pcmpeqd xmm0,xmm0,xmm0 / vinsertf128 ymm0,xmm0,1. That's still pretty trivial for something you call once to warm-up (or keep warm) the execution units well ahead of a critical loop. And if you want something that can inline, you probably need inline-asm.)

我没有 AMD 硬件,因此无法对此进行测试.

I don't have AMD hardware so I can't test this.

如果有人拥有 AMD 硬件但不知道如何测试,请使用性能计数器来计算周期(最好是 m-ops 或 uops 或任何 AMD 对它们的称呼).

If anyone has AMD hardware but doesn't know how to test, use perf counters to count cycles (and preferably m-ops or uops or whatever AMD calls them).

这是我用来测试短序列的 NASM/YASM 源:

This is the NASM/YASM source I use to test short sequences:

section .text
global _start
_start:

    mov     ecx, 250000000

align 32  ; shouldn't matter, but just in case
.loop:

    dec     ecx  ; prevent macro-fusion by separating this from jnz, to avoid differences on CPUs that can't macro-fuse

%rep 6
    ;    vxorps  xmm1, xmm1, xmm1
    vxorps  ymm1, ymm1, ymm1
%endrep

    jnz .loop

    xor edi,edi
    mov eax,231    ; exit_group(0) on x86-64 Linux
    syscall

如果你不在 Linux 上,可以用 ret 替换循环(退出系统调用)之后的内容,并从 C main() 调用该函数> 功能.

If you're not on Linux, maybe replace the stuff after the loop (the exit syscall) with a ret, and call the function from a C main() function.

使用 nasm -felf64 vxor-zero.asm && 组装ld -o vxor-zero vxor-zero.o 生成静态二进制文件.(或使用 asm-link 脚本,我在关于使用/不使用 libc 的情况下组装静态/动态二进制文件的问答中发布).

Assemble with nasm -felf64 vxor-zero.asm && ld -o vxor-zero vxor-zero.o to make a static binary. (Or use the asm-link script I posted in a Q&A about assembling static/dynamic binaries with/without libc).

i7-6700k(英特尔 Skylake)上的示例输出,频率为 3.9GHz.(IDK 为什么我的机器在闲置几分钟后才上升到 3.9GHz.Turbo 上升到 4.2 或 4.4GHz 在启动后正常工作).由于我使用的是性能计数器,因此机器运行的时钟速度实际上并不重要.不涉及加载/存储或代码缓存未命中,因此无论多长时间,所有内容的核心时钟周期数都是恒定的.

Example output on an i7-6700k (Intel Skylake), at 3.9GHz. (IDK why my machine only goes up to 3.9GHz after it's been idle a few minutes. Turbo up to 4.2 or 4.4GHz works normally right after boot). Since I'm using perf counters, it doesn't actually matter what clock speed the machine is running. No loads/stores or code-cache misses are involved, so the number of core-clock-cycles for everything is constant regardless of how long they are.

$ alias disas='objdump -drwC -Mintel'
$ b=vxor-zero;  asm-link "$b.asm" && disas "$b" && ocperf.py stat -etask-clock,cycles,instructions,branches,uops_issued.any,uops_retired.retire_slots,uops_executed.thread -r4 "./$b"
+ yasm -felf64 -Worphan-labels -gdwarf2 vxor-zero.asm
+ ld -o vxor-zero vxor-zero.o

vxor-zero:     file format elf64-x86-64


Disassembly of section .text:

0000000000400080 <_start>:
  400080:       b9 80 b2 e6 0e          mov    ecx,0xee6b280
  400085:       66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00    data16 data16 data16 data16 data16 nop WORD PTR cs:[rax+rax*1+0x0]
  400094:       66 66 66 2e 0f 1f 84 00 00 00 00 00     data16 data16 nop WORD PTR cs:[rax+rax*1+0x0]

00000000004000a0 <_start.loop>:
  4000a0:       ff c9                   dec    ecx
  4000a2:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000a6:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000aa:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000ae:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000b2:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000b6:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000ba:       75 e4                   jne    4000a0 <_start.loop>
  4000bc:       31 ff                   xor    edi,edi
  4000be:       b8 e7 00 00 00          mov    eax,0xe7
  4000c3:       0f 05                   syscall

(ocperf.py is a wrapper with symbolic names for CPU-specific events.  It prints the perf command it actually ran):

perf stat -etask-clock,cycles,instructions,branches,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xc2,umask=0x2,name=uops_retired_retire_slots/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/ -r4 ./vxor-zero

 Performance counter stats for './vxor-zero' (4 runs):

        128.379226      task-clock:u (msec)       #    0.999 CPUs utilized            ( +-  0.07% )
       500,072,741      cycles:u                  #    3.895 GHz                      ( +-  0.01% )
     2,000,000,046      instructions:u            #    4.00  insn per cycle           ( +-  0.00% )
       250,000,040      branches:u                # 1947.356 M/sec                    ( +-  0.00% )
     2,000,012,004      uops_issued_any:u         # 15578.938 M/sec                   ( +-  0.00% )
     2,000,008,576      uops_retired_retire_slots:u # 15578.911 M/sec                   ( +-  0.00% )
       500,009,692      uops_executed_thread:u    # 3894.787 M/sec                    ( +-  0.00% )

       0.128516502 seconds time elapsed                                          ( +-  0.09% )

+- 0.02% 的东西是因为我运行了 perf stat -r4,所以它运行了我的二进制文件 4 次.

The +- 0.02% stuff is because I ran perf stat -r4, so it ran my binary 4 times.

uops_issued_anyuops_retired_retire_slots 是融合域(Skylake 和 Bulldozer 系列上每个时钟的前端吞吐量限制为 4 个).计数几乎相同,因为没有分支错误预测(这会导致推测性发出的 uops 被丢弃而不是退役).

uops_issued_any and uops_retired_retire_slots are fused-domain (front-end throughput limit of 4 per clock on Skylake and Bulldozer-family). The counts are nearly identical because there are no branch mispredicts (which lead to speculatively-issued uops being discarded instead of retired).

uops_executed_thread 是未融合域的 uops(执行端口).异或归零在 Intel CPU 上不需要任何,因此实际执行的只是 dec 和分支 uops.(如果我们将操作数更改为 vxorps,那么它不仅仅是将寄存器清零,例如 vxorps ymm2, ymm1,ymm0 将输出写入下一个不读取的寄存器,则执行 uops将匹配融合域 uop 计数.我们会看到吞吐量限制为每个时钟三个 vxorps.)

uops_executed_thread is unfused-domain uops (execution ports). xor-zeroing doesn't need any on Intel CPUs, so it's just the dec and branch uops that actually execute. (If we changed the operands to vxorps so it wasn't just zeroing a register, e.g. vxorps ymm2, ymm1,ymm0 to write the output to a register that the next one doesn't read, uops executed will match the fused-domain uop count. And we'd see that the throughput limit is three vxorps per clock.)

在 500M 时钟周期内发出的 2000M 融合域 uops 是每个时钟发出 4.0 uops:达到理论上的最大前端吞吐量.6 * 250 是 1500,因此这些计数与 Skylake 解码 vxorps ymm,ymm,ymm 匹配到 1 个融合域 uop.

2000M fused-domain uops issued in 500M clock cycles is 4.0 uops issued per clock: achieving the theoretical max front-end throughput. 6 * 250 is 1500, so these counts match with Skylake decoding vxorps ymm,ymm,ymm to 1 fused-domain uop.

循环中的 uops 数量不同,情况就不那么好了.例如5 uop 循环仅以每个时钟 3.75 uop 发出.我特意选择了 8 uop(当 vxorps 解码为单 uop 时).

With a different number of uops in the loop, things aren't as good. e.g. a 5 uop loop only issued at 3.75 uops per clock. I intentionally chose this to be 8 uops (when vxorps decodes to a single-uop).

Zen 的 issue-width 是每个周期 6 uops,所以它可能会在不同的展开量下做得更好.(参见这个问答;A 有关 uop 计数不是问题宽度倍数的短循环的更多信息,在英特尔 SnB 系列 uarches 上).

The issue-width of Zen is 6 uops per cycle, so it may do better with a different amount of unrolling. (See this Q&A for more about short loops whose uop count isn't a multiple of the issue width, on Intel SnB-family uarches).

推荐答案

在 AMD Ryzen 上,将 ymm 寄存器与自身异或会生成两个微操作,而将 xmm 寄存器与自身异或只会生成一个微操作.因此,对 ymm 寄存器进行 xeroing 的最佳方法是将相应的 xmm 寄存器与自身进行异或,并依赖于隐式零扩展.

xor'ing a ymm register with itself generates two micro-ops on AMD Ryzen, while xor'ing an xmm register with itself generates only one micro-op. So the optimal way of xeroing a ymm register is to xor the corresponding xmm register with itself and rely on implicit zero extension.

目前唯一支持 AVX512 的处理器是 Knights Landing.它使用单个微操作对 zmm 寄存器进行异或运算.通过将向量一分为二来处理向量大小的新扩展是很常见的.这发生在从 64 位到 128 位的转换以及从 128 到 256 位的转换中.未来的某些处理器(来自 AMD 或英特尔或任何其他供应商)很可能会将 512 位向量拆分为两个 256 位向量甚至四个 128 位向量.因此,将 zmm 寄存器置零的最佳方法是将 128 位寄存器与其自身进行异或并依赖于零扩展.你是对的,128 位 VEX 编码指令短一两个字节.

The only processor that supports AVX512 today is Knights Landing. It uses a single micro-op for xor'ing a zmm register. It is very common to handle a new extension of vector size by splitting it in two. This happened with the transition from 64 to 128 bits and with the transition from 128 to 256 bits. It is more than likely that some processors in the future (from AMD or Intel or any other vendor) will split 512-bit vectors into two 256-bit vectors or even four 128-bit vectors. So the optimal way to zero a zmm register is to xor the 128-bit register with itself and rely on zero extension. And you are right, the 128-bit VEX-coded instruction is one or two bytes shorter.

大多数处理器认为寄存器与自身的异或与寄存器的先前值无关.

Most processors recognize the xor of a register with itself to be independent of the previous value of the register.

这篇关于使用 xmm 寄存器在 AMD Jaguar/Bulldozer/Zen 上进行 vxorps 归零是否比 ymm 更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆