从 x86 CPU 生成 64 字节读取 PCIe TLP [英] Generating a 64-byte read PCIe TLP from an x86 CPU

查看:25
本文介绍了从 x86 CPU 生成 64 字节读取 PCIe TLP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在向 PCIe 设备写入数据时,可以使用写入组合映射来提示 CPU 应该向设备生成 64 字节的 TLP.

When writing data to a PCIe device, it is possible to use a write-combining mapping to hint the CPU that it should generate 64-byte TLPs towards the device.

是否可以为读取做类似的事情?以某种方式提示 CPU 读取整个缓存行或更大的缓冲区,而不是一次读取一个字?

Is it possible to do something similar for reads? Somehow hint the CPU to read an entire cache line or a larger buffer instead of reading one word at a time?

推荐答案

英特尔有 一份关于从视频 RAM 复制到主内存的白皮书;这应该是相似的,但要简单得多(因为数据适合 2 或 4 个向量寄存器).

Intel has a white-paper on copying from video RAM to main memory; this should be similar but a lot simpler (because the data fits in 2 or 4 vector registers).

它说 NT 加载会将整个缓存行数据从 WC 内存拉入 LFB:

It says that NT loads will pull a whole cache-line of data from WC memory into a LFB:

普通加载指令以指令请求的相同大小为单位从 USWC 内存中提取数据.相比之下,像 MOVNTDQA 这样的流式加载指令通常会将完整的缓存数据行拉到 CPU 中的特殊填充缓冲区".后续流式加载将从该填充缓冲区中读取数据,从而减少延迟.

Ordinary load instructions pull data from USWC memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay.

使用 AVX2 _mm256_stream_load_si256() 或 SSE4.1/AVX1 128 位版本.

Use AVX2 _mm256_stream_load_si256() or the SSE4.1/AVX1 128-bit version.

填充缓冲区是一种有限的资源,因此您肯定希望编译器生成 asm 来执行 64 字节缓存行的两个对齐加载背靠背,然后存储到常规内存.

Fill-buffers are a limited resource, so you definitely want the compiler to generate asm that does the two aligned loads of a 64-byte cache-line back to back, then store to regular memory.

如果您一次处理多个 64 字节块,请参阅英特尔的白皮书,了解有关使用在 L1d 中保持热态的小型反弹缓冲区的建议,以避免将存储与 NT 负载混合到 DRAM.(到 DRAM 的 L1d 驱逐,如 NT 存储,也需要行填充缓冲区,LFB).

If you're doing more than one 64-byte block at a time, see Intel's white-paper for a suggestion on using a small bounce buffer that stays hot in L1d to avoid mixing stores to DRAM with NT loads. (L1d evictions to DRAM, like NT stores, also require line-fill buffers, LFBs).

请注意 _mm256_stream_load_si256() 对内存类型以外的内存类型根本没有用厕所.NT 提示在当前硬件上被忽略,但与常规负载相比,它无论如何都要花费额外的 ALU uop.有prefetchnta,但那是完全不同的野兽.

Note that _mm256_stream_load_si256() is not useful at all on memory types other than WC. The NT hint is ignored on current hardware, but it costs an extra ALU uop anyway vs. a regular load. There is prefetchnta, but that's a totally different beast.

这篇关于从 x86 CPU 生成 64 字节读取 PCIe TLP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆