从 x86 CPU 生成 64 字节读取 PCIe TLP [英] Generating a 64-byte read PCIe TLP from an x86 CPU
问题描述
在向 PCIe 设备写入数据时,可以使用写入组合映射来提示 CPU 应该向设备生成 64 字节的 TLP.
When writing data to a PCIe device, it is possible to use a write-combining mapping to hint the CPU that it should generate 64-byte TLPs towards the device.
是否可以为读取做类似的事情?以某种方式提示 CPU 读取整个缓存行或更大的缓冲区,而不是一次读取一个字?
Is it possible to do something similar for reads? Somehow hint the CPU to read an entire cache line or a larger buffer instead of reading one word at a time?
推荐答案
英特尔有 一份关于从视频 RAM 复制到主内存的白皮书;这应该是相似的,但要简单得多(因为数据适合 2 或 4 个向量寄存器).
Intel has a white-paper on copying from video RAM to main memory; this should be similar but a lot simpler (because the data fits in 2 or 4 vector registers).
它说 NT 加载会将整个缓存行数据从 WC 内存拉入 LFB:
It says that NT loads will pull a whole cache-line of data from WC memory into a LFB:
普通加载指令以指令请求的相同大小为单位从 USWC 内存中提取数据.相比之下,像 MOVNTDQA 这样的流式加载指令通常会将完整的缓存数据行拉到 CPU 中的特殊填充缓冲区".后续流式加载将从该填充缓冲区中读取数据,从而减少延迟.
Ordinary load instructions pull data from USWC memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay.
使用 AVX2 _mm256_stream_load_si256()
或 SSE4.1/AVX1 128 位版本.
Use AVX2 _mm256_stream_load_si256()
or the SSE4.1/AVX1 128-bit version.
填充缓冲区是一种有限的资源,因此您肯定希望编译器生成 asm 来执行 64 字节缓存行的两个对齐加载背靠背,然后存储到常规内存.
Fill-buffers are a limited resource, so you definitely want the compiler to generate asm that does the two aligned loads of a 64-byte cache-line back to back, then store to regular memory.
如果您一次处理多个 64 字节块,请参阅英特尔的白皮书,了解有关使用在 L1d 中保持热态的小型反弹缓冲区的建议,以避免将存储与 NT 负载混合到 DRAM.(到 DRAM 的 L1d 驱逐,如 NT 存储,也需要行填充缓冲区,LFB).
If you're doing more than one 64-byte block at a time, see Intel's white-paper for a suggestion on using a small bounce buffer that stays hot in L1d to avoid mixing stores to DRAM with NT loads. (L1d evictions to DRAM, like NT stores, also require line-fill buffers, LFBs).
请注意 _mm256_stream_load_si256()
对内存类型以外的内存类型根本没有用厕所.NT 提示在当前硬件上被忽略,但与常规负载相比,它无论如何都要花费额外的 ALU uop.有prefetchnta
,但那是完全不同的野兽.
Note that _mm256_stream_load_si256()
is not useful at all on memory types other than WC. The NT hint is ignored on current hardware, but it costs an extra ALU uop anyway vs. a regular load. There is prefetchnta
, but that's a totally different beast.
这篇关于从 x86 CPU 生成 64 字节读取 PCIe TLP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!