如何从ioremap()地址加载avx-512 zmm寄存器? [英] How to load a avx-512 zmm register from a ioremap() address?

查看:207
本文介绍了如何从ioremap()地址加载avx-512 zmm寄存器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是创建一个有效负载超过64b的PCIe事务.为此,我需要读取一个ioremap()地址.

My goal is to create a PCIe transaction with more than 64b payload. For that I need to read an ioremap() address.

对于128b和256b,我可以分别使用xmmymm寄存器,它们可以按预期工作.

For 128b and 256b I can use xmm and ymm registers respectively and that works as expected.

现在,我想对512b zmm寄存器(类似内存的存储?!)做同样的事情

Now, I'd like to do the same for 512b zmm registers (memory-like storage?!)

我不允许在此处显示的经许可的代码使用256b的汇编代码:

A code under license I'm not allowed to show here, uses assembly code for 256b:

void __iomem *addr;
uint8_t datareg[32];
[...]
// Read memory address to ymm (to have 256b at once):
asm volatile("vmovdqa %0,%%ymm1" : : "m"(*(volatile uint8_t * __force) addr));
// Copy ymm data to stack data: (to be able to use that in a gcc handled code)
asm volatile("vmovdqa %%ymm1,%0" :"=m"(datareg): :"memory");

这将在使用EXTRA_CFLAGS += -mavx2 -mavx512f编译的内核模块中使用,以支持 AVX-512 . edit:要在编译时检查是否支持__AVX512F____AVX2__.

This is to be used in a kernel module compiled with EXTRA_CFLAGS += -mavx2 -mavx512f to support AVX-512. edit:To check at compile time if __AVX512F__ and __AVX2__ are supported.

  1. 为什么此示例使用ymm1而不使用其他寄存器ymm0-2-3-4..15?
  2. 如何读取512b zmm寄存器的地址?
  3. 如何确定寄存器的两个asm行之间不会被覆盖?
  1. Why does this example use ymm1 and not a different register ymm0-2-3-4..15?
  2. How can I read an address to a 512b zmm register?
  3. How can I be sure the register won't be overwritten between the two asm lines?

zmm替换ymm gcc 显示Error: operand size mismatch for vmovdqa'`.

Simply replacing ymm by zmm, gcc shows Error: operand size mismatch forvmovdqa'`.

如果该代码不正确或不是最佳实践,请让它解决,因为我才刚开始研究它.

If that code isn't correct or the best practice, let solve that first since I just started to dig into that.

推荐答案

您需要vmovdqa32,因为AVX512具有逐元素遮罩;所有指令都需要SIMD元素大小.请参阅下面的安全版本.如果您阅读 vmovdqa 的手册,就会发现这一点. ZMM的vmovdqa32记录在同一条目中.

You need vmovdqa32 because AVX512 has per-element masking; all instructions need a SIMD element size. See below for a version that should be safe. You would have seen this if you read the manual for vmovdqa; vmovdqa32 for ZMM is documented in the same entry.

(3):内核代码在禁用SSE/AVX的情况下进行编译,因此编译器将永远不会生成触摸xmm/ymm/zmm寄存器的指令.(对于大多数内核,例如Linux).这就是使此代码安全"避免在asm语句之间修改寄存器的原因.尽管针对Linux的md-raid代码可以做到这一点,但是仍然要针对此用例将它们分开编写一个语句仍然不是一个好主意. OTOH让编译器在存储和加载之间安排一些其他指令不是一件坏事.

(3): Kernel code is compiled with SSE/AVX disabled so the compiler won't ever generate instructions that touch xmm/ymm/zmm registers. (For most kernels, e.g. Linux). That's what makes this code "safe" from having the register modified between asm statements. It's still a bad idea to make them separate statements for this use-case though, despite the fact that Linux md-raid code does that. OTOH letting the compiler schedule some other instructions between the store and load is not a bad thing.

asm语句之间的顺序是由它们都为volatile提供的-编译器不能仅使用普通操作对其他易失性操作重新排序易失性操作.

Ordering between asm statements is provided by them both being volatile - compilers can't reorder volatile operations with other volatile operations, only with plain operations.

例如,在Linux中,只有在调用kernel_fpu_begin()kernel_fpu_end() 的调用之间使用FP/SIMD指令是安全的(这很慢:开始时会立即保存整个SIMD状态,并且end恢复它或至少将其标记为需要在返回用户空间之前发生). 如果您弄错了,您的代码将悄无声息地破坏用户空间矢量寄存器!

In Linux for example, it's only safe to use FP / SIMD instructions between calls to kernel_fpu_begin() and kernel_fpu_end() (which are slow: begin saves the whole SIMD state on the spot, and end restores it or at least marks it as needing to happen before return to user-space). If you get this wrong, your code will silently corrupt user-space vector registers!!

这将在用EXTRA_CFLAGS + = -mavx2 -mavx512f编译的内核模块中使用,以支持AVX-512.

This is to be used in a kernel module compiled with EXTRA_CFLAGS += -mavx2 -mavx512f to support AVX-512.

您不能这样做.让编译器在内核代码中发出自己的AVX/AVX512指令可能是灾难性的,因为您不能阻止它在kernel_fpu_begin()之前丢弃向量reg.仅通过内联汇编使用矢量reg.

You must not do that. Letting the compiler emit its own AVX / AVX512 instructions in kernel code could be disastrous because you can't stop it from trashing a vector reg before kernel_fpu_begin(). Only use vector regs via inline asm.

还请注意,根本不使用ZMM寄存器会暂时降低该内核的最大Turbo时钟速度(或对于所有内核,在客户端"芯片上,因为它们的时钟速度被锁定在一起).请参见降低CPU频率的SIMD指令

Also note that using ZMM registers at all temporarily reduces max turbo clock speed for that core (or on a "client" chip, for all cores because their clock speeds are locked together). See SIMD instructions lowering CPU frequency

我想使用512b zmm *寄存器作为类似内存的存储.

I'd like to use 512b zmm* registers as memory-like storage.

借助快速的L1d高速缓存和存储转发,您确定通过将ZMM寄存器用作快速的类似内存"(线程本地)存储,甚至还能获得任何收益吗?尤其是当您只能从SIMD寄存器中获取数据并通过存储/从数组中重新加载(或者通过更多的内联asm进行改组...)将数据返回到整数reg中时. Linux中的一些地方(例如md RAID5/RAID6)使用SIMD ALU指令进行块XOR或raid6奇偶校验,这是值得kernel_fpu_begin()的开销的.但是,如果您只是 加载/存储以使用ZMM/YMM状态作为不能缓存丢失,不循环大缓冲区的存储,则可能不值得.

With fast L1d cache and store-forwarding, are you sure you'd even gain anything from using ZMM registers as fast "memory like" (thread-local) storage? Especially when you can only get data out of SIMD registers and back into integer regs via store/reload from an array (or more inline asm to shuffle...). A few places in Linux (like md RAID5/RAID6) use SIMD ALU instructions for block XOR or raid6 parity, and there it is worth the overhead of kernel_fpu_begin(). But if you're just loading / storing to use ZMM / YMM state as storage that can't cache-miss, not looping over big buffers, it's probably not worth it.

(事实证明,您实际上是想使用64字节的副本来生成PCIe事务,这是一个完全独立的用例,而不是将数据长期保存在寄存器中.)

( turns out you actually want to use 64-byte copies to generate PCIe transactions, which is a totally separate use-case than keeping data around in registers long-term.)

很像您实际上所做的,以获得64字节的PCIe事务.

Like you apparently actually do, to get a 64-byte PCIe transaction.

最好使它成为单个asm语句,因为否则,除了两个asm volatile强制执行该顺序以外,两个asm语句之间没有任何联系. (如果在启用AVX指令以供编译器使用的情况下执行此操作,则只需使用内在函数,而不是使用"=x"/"x"输出/输入来连接单独的asm语句.)

It would be better to make this a single asm statement, because otherwise there's no connection between the two asm statements other than both being asm volatile forces that ordering. (If you were doing this with AVX instructions enabled for the compiler's use, you'd simply use intrinsics though, not "=x" / "x" outputs / inputs to connect separate asm statements.)

为什么该示例选择ymm1?与允许使用2字节VEX前缀的ymm0..7任意其他随机选择一样好(ymm8..15可能需要这些指令上更大的代码大小.)禁用AVX代码生成后,无法要求编译器选择带有虚拟输出操作数的便捷寄存器.

Why the example chose ymm1? As good as any other random choice of ymm0..7 to allow a 2-byte VEX prefix (ymm8..15 might need more code size on those instructions.) With AVX code-gen disabled there's no way to ask the compiler to pick a convenient register for you with a dummy output operand.

uint8_t datareg[32];损坏;必须为alignas(32) uint8_t datareg[32];才能确保vmovdqa存储不会出错.

uint8_t datareg[32]; is broken; it needs to be alignas(32) uint8_t datareg[32]; to ensure that a vmovdqa store won't fault.

输出中的"memory"垃圾是无用的;整个数组已经是输出操作数,因为您将数组变量命名为输出,而不仅仅是指针. (实际上,强制转换为数组指针是您告诉编译器一个简单的解引用指针输入或输出实际上更宽的范围,例如,对于包含循环的asm,或者在这种情况下,对于不能使用SIMD的asm,告诉编译器有关向量的信息.)

The "memory" clobber on the output is useless; the whole array is already an output operand because you named an array variable as the output, not just a pointer. (In fact, casting to pointer-to-array is how you tell the compiler that a plain dereferenced-pointer input or output is actually wider, e.g. for asm that contains loops or in this case for asm that uses SIMD when we can't tell the compiler about the vectors. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

asm语句是易失性的,因此不会对其进行优化以重用相同的输出. asm语句涉及的唯一C对象是数组对象,它是输出操作数,因此编译器已经知道这种效果.

The asm statement is volatile so it won't be optimized away to reuse the same output. The only C object touched by the asm statement is the array object which is an output operand so the compilers knows about that effect already.

AVX512在任何指令(包括加载/存储)中都具有逐元素的掩码. 这意味着 vmovdqa32 vmovdqa64用于不同的遮罩(如果包含AVX512BW,则为vmovdqu8/16/32/64).指令的FP版本已经将ps或pd嵌入了助记符,因此该处的ZMM向量的助记符保持不变.如果您查看编译器生成的asm,以查找具有512位向量或内在函数的自动向量化循环,就会立刻看到这一点.

AVX512 has per-element masking as part of any instruction, including loads/stores. That means there's vmovdqa32 and vmovdqa64 for different masking granularity. (And vmovdqu8/16/32/64 if you include AVX512BW). FP versions of instructions already have ps or pd baked in to the mnemonic so the mnemonic stays the same for ZMM vectors there. You'd see this right away if you looked at compiler-generated asm for an auto-vectorized loop with 512-bit vectors, or intrinsics.

这应该很安全:

#include <stdalign.h>
#include <stdint.h>
#include <string.h>

#define __force 
int foo (void *addr) {
    alignas(16) uint8_t datareg[64];   // 16-byte alignment doesn't cost any extra code.
      // if you're only doing one load per function call
      // maybe not worth the couple extra instructions to align by 64

    asm volatile (
      "vmovdqa32  %1, %%zmm16\n\t"   // aligned
      "vmovdqu32  %%zmm16, %0"       // maybe unaligned; could increase latency but prob. doesn't hurt throughput much compared to an IO read.
        : "=m"(datareg)
        : "m" (*(volatile const char (* __force)[64]) addr)  // the whole 64 bytes are an input
     : // "memory"  not needed, except for ordering wrt. non-volatile accesses to other memory
    );

    int retval;
    memcpy(&retval, datareg+8, 4);  // memcpy can inline as long as the kernel doesn't use -fno-builtin
                    // but IIRC Linux uses -fno-strict-aliasing so you could use cast to (int*)
    return retval;
}

我不知道您的__force是如何定义的;它可能放在addr的前面,而不是数组指针类型.或者,它可能是volatile const char数组元素类型的一部分.同样,请参见以了解有关该输入强制转换的更多信息.

I don't know how your __force is defined; it might go in front of addr instead of as the array-pointer type. Or maybe it goes as part of the volatile const char array element type. Again, see How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more about that input cast.

由于您正在读取IO内存,因此asm volatile是必需的;再次读取相同的地址可能会读取不同的值.如果您正在读取另一个CPU内核可能已异步修改的内存,则相同.

Since you're reading IO memory, asm volatile is necessary; another read of the same address could read a different value. Same if you were reading memory that another CPU core could have modified asynchronously.

否则,如果您想让编译器优化执行相同的副本,则我认为asm volatile是不必要的.

Otherwise I think asm volatile is not necessary if you want to let the compiler optimize away doing the same copy.

也没有必要使用"memory"缓冲区:我们告诉编译器输入和输出的全宽,因此它可以全面了解发生的情况.

A "memory" clobber also isn't necessary: we tell the compiler about the full width of both the input and the output, so it has a full picture of what's going on.

如果需要订购wrt.其他非volatile内存访问,则可以使用"memory"缓冲区.但是asm volatile被命令wrt.取消对volatile指针的引用,包括您应用于任何无锁线程间通信(假定这是 Linux 内核)的READ_ONCE和WRITE_ONCE.

If you need ordering wrt. other non-volatile memory accesses, you could use a "memory" clobber for that. But asm volatile is ordered wrt. dereferences of volatile pointers, including READ_ONCE and WRITE_ONCE which you should be using for any lock-free inter-thread communication (assuming this is the Linux kernel).

ZMM16..31不需要vzeroupper可以避免性能问题,并且EVEX始终是固定长度.

ZMM16..31 doesn't need a vzeroupper to avoid performance problems, and EVEX is always fixed length.

我只将输出缓冲区对齐16个字节.如果有一个实际的函数调用未针对每个64字节负载进行内联,则将RSP对齐64的开销可能会超过3/4的高速缓存行拆分存储的开销.我认为存储转发仍然可以从该大型存储有效地进行,以缩小Skylake-X系列CPU上该缓冲区的重载.

I only aligned the output buffer by 16 bytes. If there's an actual function call that doesn't get inlined for each 64-byte load, overhead of aligning RSP by 64 might be more than the cost of a cache-line-split store 3/4 of the time. Store-forwarding I think still works efficiently from that wide store to narrow reloads of chunks of that buffer on Skylake-X family CPUs.

如果您要读取更大的缓冲区,请使用该缓冲区进行输出,而不用通过64字节的tmp数组弹跳.

可能还有其他方法可以生成更广泛的PCIe读取事务;如果内存在WC区域中,则从同一对齐的64字节块进行的4x movntdqa加载也应该起作用.或2x vmovntdqa ymm负载;我建议您避免增加罚金.

There are probably other ways to generate wider PCIe read transactions; if the memory is in a WC region then 4x movntdqa loads from the same aligned 64-byte block should work, too. Or 2x vmovntdqa ymm loads; I'd recommend that to avoid turbo penalties.

这篇关于如何从ioremap()地址加载avx-512 zmm寄存器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆