_builtin_prefetch() 中第二个参数的作用是什么? [英] What is the effect of second argument in _builtin_prefetch()?

查看:18
本文介绍了_builtin_prefetch() 中第二个参数的作用是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

GCC 文档 此处 指定_buitin_prefetch 的用法.

The GCC doc here specifies the usage of _buitin_prefetch.

第三个论点是完美的.如果为 0,编译器生成 prefetchtnta (%rax) 指令如果为 1,编译器生成 prefetcht2 (%rax) 指令如果是 2,编译器生成 prefetcht1 (%rax) 指令如果是 3(默认),编译器生成 prefetcht0 (%rax) 指令.

Third argument is perfect. If it is 0, compiler generates prefetchtnta (%rax) instruction If it is 1, compiler generates prefetcht2 (%rax) instruction If it is 2, compiler generates prefetcht1 (%rax) instruction If it is 3 (default), compiler generates prefetcht0 (%rax) instruction.

如果我们改变第三个参数,操作码已经相应地改变了.

If we vary third argument the opcode already changed accordingly.

但是第二个参数似乎没有任何效果.

But second argument do not seem to have any effect.

__builtin_prefetch(&x,1,2);
__builtin_prefetch(&x,0,2);
__builtin_prefetch(&x,0,1);
__builtin_prefetch(&x,0,0);

以上是生成的示例代码:

The above is the sample piece of code, that generated:

以下是汇编:

 27:    0f 18 10                prefetcht1 (%rax)
  2a:   48 8d 45 fc             lea    -0x4(%rbp),%rax
  2e:   0f 18 10                prefetcht1 (%rax)
  31:   48 8d 45 fc             lea    -0x4(%rbp),%rax
  35:   0f 18 18                prefetcht2 (%rax)
  38:   48 8d 45 fc             lea    -0x4(%rbp),%rax
  3c:   0f 18 00                prefetchnta (%rax)

可以观察到第三个参数的操作码的变化.但即使我更改了第二个参数(指定读或写),汇编代码也保持不变.<27,2a>和<2e,31>.所以它不会向机器提供任何信息.那么第二个论点的目的是什么?

One can observe the change in opcodes wrt 3rd argument. But even if I changed 2nd argument (that specifies read or write), the assembly code remains the same. <27,2a> and <2e,31>. So it not giving any information to the machine. Then what is the purpose of the second argument?

推荐答案

正如 Margaret 指出的,参数之一是 rw.

As Margaret points out, one of the args is rw.

Baseline x86-64 (SSE2) 不包括写预取指令,但它们作为 ISA 扩展存在.像往常一样,编译器不会使用它们,除非您告诉他们您正在为支持它的目标进行编译.(但它们会在任何非古老 CPU 上安全地作为 NOP 运行.)

Baseline x86-64 (SSE2) does not include write-prefetch instructions, but they exist as ISA extensions. As usual, compilers won't use them unless you tell them you're compiling for a target that supports it. (But they will safely run as a NOP on any non-ancient CPU.)

这两条指令是:PREFETCHW(进入 L1d 缓存,如 PREFETCHT0)和 PREFETCHWT1(进入 L2 缓存,如 PREFETCHT1).他们通过发送一个 RFO(Read-For-Ownership)来预取一行进入 Exclusive MESI 状态.这会使每个其他内核中该行的每个其他副本无效.从该状态开始,存储缓冲区可以将数据提交到一行(并将其翻转为已修改),而无需任何进一步的核外流量.或者如果在驱逐前没有修改,可以简单地删除.

The two instructions are: PREFETCHW (into L1d cache like PREFETCHT0) and PREFETCHWT1 (into L2 cache like PREFETCHT1). They prefetch a line into Exclusive MESI state by sending out an RFO (Read-For-Ownership). This invalidates every other copy of the line in every other core. From that state, the store buffer can commit data to a line (and flip it to Modified) without any further off-core traffic. Or if not modified before eviction, can simply be dropped.

PREFETCHW 指令只是一个提示,不会影响程序行为.如果执行该指令,该指令会将数据移到更靠近处理器的位置,并使其他缓存副本无效,以备将来写入该行.

The PREFETCHW instruction is merely a hint and does not affect program behavior. If executed, this instruction moves data closer to the processor and invalidates other cached copies in anticipation of the line being written to in the future.

它们具有几乎相同的机器编码,相同的 OF 0D 操作码,仅在 ModRM /1/2 不同>>r 字段.就像读预取 PREFETCHT0/T1/T2/NTA 共享一个操作码并且仅通过 /0 (NTA)、/1 (T0) 等区分ModRM /r 字段.使用 /r 位作为额外的操作码位不是唯一的;其他单操作数和立即数指令也这样做.

They have nearly the same machine encoding, same OF 0D opcode, differing only in /1 or /2 in the ModRM /r field. Just like how read-prefetch PREFETCHT0/T1/T2/NTA share an opcode and are differentiated only by /0 (NTA), /1 (T0), etc. in the ModRM /r field. Using /r bits as extra opcode bits is not unique; other one-operand and immediate instructions also do that.

相关:预取读或写的区别

PREFETCHW 最初出现在 AMD 的 3DNow!,但有自己的功能位,因此 CPU 可以表示支持它,但不支持其他 3DNow!(在 MMX regs 中打包 -float)指令.

PREFETCHW originally appeared in AMD's 3DNow!, but has its own feature bit so that CPUs can indicate support for it but not other 3DNow! (packed-float in MMX regs) instructions.

PREFETCHWT1 也有自己的 CPUID 功能位,但可能与 AVX512PF 相关联.它似乎只在 Xeon Phi(Knight's Landing/Knight's Mill)中可用,而不是主流 Skylake-AVX512,与 AVX512PF 相同(https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512).(证据:根据 Intel 的未来扩展手册,带有 EAX=7/ECX=0 的 CPUID 提供了 ECX 中的功能位图,包括位 00:PREFETCHWT1(仅限英特尔® 至强融核™.)还有 邮件列表.

PREFETCHWT1 also has its own CPUID feature bit, but might be associated with AVX512PF. It appears to only be available in Xeon Phi (Knight's Landing / Knight's Mill), not mainstream Skylake-AVX512, same as AVX512PF (https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512). (Evidence: According to Intel's Future Extensions manual, CPUID with EAX=7/ECX=0 gives a feature bitmap in ECX including Bit 00: PREFETCHWT1 (Intel® Xeon Phi™ only.) Also mailing list.

__builtin_prefetch(p,1,2); 使用 GCC 编译如下:

__builtin_prefetch(p,1,2); compiles as follows with GCC:

  • 没有 -m 选项的 PREFETCHT1,或 -march=haswell 或更旧的 Intel.
  • 具有 AMD 目标的 PREFETCHW,例如 -march=k8-march=bdver2 (Piledriver).
  • PREFETCHW 带有 -march=broadwell 或更新的 Intel SnB 系列,和/或 -mprfchw 适用于任何架构.
  • PREFETCHWT1 与 -mprefetchwt1.(如果 PREFETCHW 也可用,gcc 将它用于 locality=3,但 PREFETCHWT1 用于 locality<=2.)出于某种原因,GCC 不会将其作为 -march=knl 的一部分启用>-march=knm,但 clang 确实如此.我认为这是 GCC 的疏忽.

  • PREFETCHT1 with no -m options, or -march=haswell or older Intel.
  • PREFETCHW with an AMD target, like -march=k8 or -march=bdver2 (Piledriver).
  • PREFETCHW with -march=broadwell or newer Intel SnB-family, and/or -mprfchw for any arch.
  • PREFETCHWT1 with -mprefetchwt1. (If PREFETCHW is also available, gcc uses it for locality=3, but PREFETCHWT1 for locality<=2.) GCC for some reason doesn't enable this as part of -march=knl or -march=knm, but clang does. I think this is an oversight in GCC.

-mprefetchwt1 暗示 -mprfchw.另请参阅 GCC 手册中的 x86 选项 部分,了解有关 -march=native-march=whatever 以启用一组 ISA 扩展并适当设置 -mtune=whatever.

-mprefetchwt1 implies -mprfchw. See also the x86 options section in the GCC manual for more about -march=native vs. -march=whatever to enable a set of ISA extensions and set -mtune=whatever appropriately.

检查出来的 Godbolt编译探险 - 用于<代码> -march = Haswell的与<代码>-march=broadwell -mprefetchwt1.或者自己修改编译器参数.

Check it out on the Godbolt compiler explorer, for -march=haswell vs. -march=broadwell -mprefetchwt1. Or modify the compiler args yourself.

clang -O3 -march=knlgcc -O3 -march=broadwell -mprefetchwt1 使 asm 相同:

clang -O3 -march=knl, and gcc -O3 -march=broadwell -mprefetchwt1 make the same asm:

pref:
        prefetchwt1     [rdi]    #   __builtin_prefetch(p,1,2);  // KNL only, otherwise we get prefetchw
        prefetchw       [rdi]    #   __builtin_prefetch(p,1,3);

        prefetcht0      [rdi]    #   __builtin_prefetch(p,0,3);
        prefetcht1      [rdi]    #   __builtin_prefetch(p,0,2);
        prefetcht2      [rdi]    #   __builtin_prefetch(p,0,1);
        prefetchnta     [rdi]    #   __builtin_prefetch(p,0,0);
        ret

还要注意他们的 0F 0D r/m8 机器码解码作为没有 PREFETCHW 或 3DNow 的非古代 CPU 上的多字节 NOP!特征位.在早期的 64 位 Intel CPU 上,这是一条非法指令.(较新版本的 Windows 要求 PREFETCHW 无故障地执行,在这种情况下,人们谈论 CPU支持 PREFETCHW",即使它作为 NOP 运行).

Also note that their 0F 0D r/m8 machine code decodes as a multi-byte NOP on non-ancient CPUs that don't have the PREFETCHW or 3DNow! feature-bit. On early 64-bit Intel CPUs, it's an illegal instruction. (Newer versions of Windows require that PREFETCHW executes without faulting, and in that context people talk about a CPU "supporting PREFETCHW" even if it runs as a NOP).

支持 PREFETCHW 但不支持 PREFETCHWT1 的 CPU 实际上可能会像 PREFETCHW 一样运行 PREFETCHWT1,但我还没有测试过.(应该可以通过在不同内核上运行线程来测试,一个对某个位置执行重复存储,另一个执行 PREFETCHWT1、PREFETCHW、读取预取和 NOP,并查看写入线程的吞吐量如何受到影响.)

It's possible that CPUs which support PREFETCHW but not PREFETCHWT1 will actually run PREFETCHWT1 as if it were PREFETCHW, but I haven't tested. (It should be testable by running threads on different cores, one doing repeated stores to a location and the other doing PREFETCHWT1 vs. PREFETCHW vs. read prefetch vs. NOP, and see how the writing thread's throughput is affected.)

虽然(就像 GCC 一样),最好使用读意图预取而不是 NOP.但是您可能不想执行 PREFETCHW 和 PREFETCHT0,因为太多的预取指令不是一件好事.(特别是对于 Intel IvyBridge,它在预取指令吞吐量方面存在某种性能错误.但 IvB 会将 PREFETCHW 作为 NOP 运行,因此您只能在该 uarch 上获得一个预取.)

It might be preferable to use a read-intent prefetch instead of a NOP, though (like GCC does). But you probably don't want to do a PREFETCHW and a PREFETCHT0, because too many prefetch instructions aren't a good thing. (especially for Intel IvyBridge, which has some kind of performance bug for prefetch-instruction throughput. But IvB would run PREFETCHW as a NOP, so you're only getting one prefetch on that uarch.)

调整软件预取很困难:如果硬件预取成功完成其工作,过多的预取意味着花在实际工作上的执行资源更少.请参阅次优缓存线预取的成本每个程序员应该了解的关于内存的内容?

Tuning software-prefetch is hard: too much prefetching means fewer execution resources spent doing real work, if HW prefetch does its job successfully. See Cost of a sub-optimal cacheline prefetch and What Every Programmer Should Know About Memory?

这篇关于_builtin_prefetch() 中第二个参数的作用是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆