是否有任何现代 CPU 缓存字节存储实际上比字存储慢? [英] Are there any modern CPUs where a cached byte store is actually slower than a word store?

查看:22
本文介绍了是否有任何现代 CPU 缓存字节存储实际上比字存储慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是常见声明 将字节存储到缓存中可能会导致内部读取-修改-写入周期,或者以其他方式损害吞吐量或延迟与存储完整寄存器相比.

It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register.

但我从未见过任何例子.没有 x86 CPU 是这样的,我认为所有高性能 CPU 也可以直接修改缓存行中的任何字节.一些微控制器或低端 CPU 是否有不同(如果它们有缓存的话)?

(我不计算字可寻址机器,或者 Alpha,它是字节可寻址但缺少字节加载/存储指令.我说的是 ISA 原生支持的最窄存储指令.)

(I'm not counting word-addressable machines, or Alpha which is byte addressable but lacks byte load/store instructions. I'm talking about the narrowest store instruction the ISA natively supports.)

在我的研究中回答 可以现代x86 硬件不将单个字节存储到内存中?,我发现 Alpha AXP 省略字节存储的原因假定它们将作为真正的字节存储实现到缓存中,而不是包含字的 RMW 更新.(因此它会使 L1d 缓存的 ECC 保护更加昂贵,因为它需要字节粒度而不是 32 位).

In my research while answering Can modern x86 hardware not store a single byte to memory?, I found that the reasons Alpha AXP omitted byte stores presumed they'd be implemented as true byte stores into cache, not an RMW update of the containing word. (So it would have made ECC protection for L1d cache more expensive, because it would need byte granularity instead of 32-bit).

我假设在提交到 L1d 缓存期间的 word-RMW 不被视为其他实现字节存储的较新 ISA 的实现选项.

所有现代架构(早期 Alpha 除外)都可以对不可缓存的 MMIO 区域(不是 RMW 周期)执行真正的字节加载/存储,这对于为具有相邻字节 I/O 寄存器的设备编写设备驱动程序是必需的.(例如,使用外部启用/禁用信号来指定更宽总线的哪些部分保存真实数据,例如 上的 2 位 TSIZ(传输大小)这个 ColdFire CPU/微控制器,或者像 PCI/PCIe 单字节传输,或者像屏蔽选定字节的 DDR SDRAM 控制信号.)

All modern architectures (other than early Alpha) can do true byte loads/stores to uncacheable MMIO regions (not RMW cycles), which is necessary for writing device drivers for devices that have adjacent byte I/O registers. (e.g. with external enable/disable signals to specify which parts of a wider bus hold the real data, like the 2-bit TSIZ (transfer size) on this ColdFire CPU/microcontroller, or like PCI / PCIe single byte transfers, or like DDR SDRAM control signals that mask selected bytes.)

也许在缓存中为字节存储执行 RMW 循环是微控制器设计需要考虑的事情,即使它不是针对像 Alpha 这样的 SMP 服务器/工作站的高端超标量流水线设计?

Maybe doing an RMW cycle in cache for byte stores would be something to consider for a microcontroller design, even though it's not for a high-end superscalar pipelined design aimed at SMP servers / workstations like Alpha?

我认为这种说法可能来自字寻址机器.或者来自未对齐的 32 位存储,需要在许多 CPU 上进行多次访问,并且人们错误地将其概括为字节存储.

I think this claim might come from word-addressable machines. Or from unaligned 32-bit stores requiring multiple accesses on many CPUs, and people incorrectly generalizing from that to byte stores.

为了清楚起见,我希望到同一地址的字节存储循环将在每次迭代中运行与字存储循环相同的周期.因此,对于填充数组,32 位存储可以比 8 位存储快 4 倍.(如果 32 位存储使内存带宽饱和,但 8 位存储没有,则可能会更少.)但是除非字节存储有额外的惩罚,否则您不会得到更多超过 4 倍的速度差异.(或任何字宽).

Just to be clear, I expect that a byte store loop to the same address would run at the same cycles per iterations as a word store loop. So for filling an array, 32-bit stores can go up to 4x faster than 8-bit stores. (Maybe less if 32-bit stores saturate memory bandwidth but 8-bit stores don't.) But unless byte stores have an extra penalty, you won't get more than a 4x speed difference. (Or whatever the word width is).

我说的是asm.一个好的编译器会自动向量化 C 中的字节或 int 存储循环,并使用更广泛的存储或目标 ISA 上的最佳存储(如果它们是连续的).

And I'm talking about asm. A good compiler will auto-vectorize a byte or int store loop in C and use wider stores or whatever is optimal on the target ISA, if they're contiguous.

(并且存储缓冲区中的存储合并也可能导致更广泛地提交到 L1d 缓存以获取连续的字节存储指令,因此这是微基准测试时需要注意的另一件事)

(And store coalescing in the store buffer could also result in wider commits to L1d cache for contiguous byte-store instructions, so that's another thing to watch out for when microbenchmarking)

; x86-64 NASM syntax
mov   rdi, rsp
; RDI holds at a 32-bit aligned address
mov   ecx, 1000000000
.loop:                      ; do {
    mov   byte [rdi], al
    mov   byte [rdi+2], dl     ; store two bytes in the same dword
      ; no pointer increment, this is the same 32-bit dword every time
    dec   ecx
    jnz   .loop             ; }while(--ecx != 0}


    mov   eax,60
    xor   edi,edi
    syscall         ; x86-64 Linux sys_exit(0)

或者像这样在 8kiB 数组上循环,每 8 个字节存储 1 个字节或 1 个字(对于 8kiB 的 sizeof(unsigned int)=4 和 CHAR_BIT=8 的 C 实现,但应该编译为可比较的任何 C 实现上的函数,如果 sizeof(unsigned int) 不是 2 的幂,则只有很小的偏差.ASM 上Godbolt几个不同ISA 的,没有展开,或者相同两个版本的展开量.

Or a loop over an 8kiB array like this, storing 1 byte or 1 word out of every 8 bytes (for a C implementation with sizeof(unsigned int)=4 and CHAR_BIT=8 for the 8kiB, but should compile to comparable functions on any C implementation, with only a minor bias if sizeof(unsigned int) isn't a power of 2). ASM on Godbolt for a few different ISAs, with either no unrolling, or the same amount of unrolling for both versions.

// volatile defeats auto-vectorization
void byte_stores(volatile unsigned char *arr) {
    for (int outer=0 ; outer<1000 ; outer++)
        for (int i=0 ; i< 1024 ; i++)      // loop over 4k * 2*sizeof(int) chars
            arr[i*2*sizeof(unsigned) + 1] = 123;    // touch one byte of every 2 words
}

// volatile to defeat auto-vectorization: x86 could use AVX2 vpmaskmovd
void word_stores(volatile unsigned int *arr) {
    for (int outer=0 ; outer<1000 ; outer++)
        for (int i=0 ; i<(1024 / sizeof(unsigned)) ; i++)  // same number of chars
            arr[i*2 + 0] = 123;       // touch every other int
}

根据需要调整大小,如果有人能指出 word_store()byte_store() 快的系统,我会非常好奇.(如果实际进行基准测试,请注意动态时钟速度等预热效应,以及触发 TLB 未命中和缓存未命中的首轮.)

Adjusting sizes as necessary, I'd be really curious if anyone can point to a system where word_store() is faster than byte_store(). (If actually benchmarking, beware of warm-up effects like dynamic clock speed, and the first pass triggering TLB misses and cache misses.)

或者,如果不存在适用于古老平台的实际 C 编译器,或者生成不会成为存储吞吐量瓶颈的次优代码,那么任何手工制作的 asm 都会显示效果.

Or if actual C compilers for ancient platforms don't exist or generate sub-optimal code that doesn't bottleneck on store throughput, then any hand-crafted asm that would show an effect.

任何其他证明字节存储速度减慢的方式都可以,我不坚持在数组上进行跨步循环或在一个单词内发送垃圾邮件.

Any other way of demonstrating a slowdown for byte stores is fine, I don't insist on strided loops over arrays or spamming writes within one word.

我也可以提供有关 CPU 内部结构的详细文档,或者不同指令的 CPU 周期计时数.不过,我对可能基于此声明而未经测试的优化建议或指南持怀疑态度.

I'd also be fine with detailed documentation about CPU internals, or CPU cycle timing numbers for different instructions. I'm leery of optimization advice or guides that could be based on this claim without having tested, though.

  • 任何仍然相关的 CPU 或微控制器,其中缓存字节存储有额外的惩罚?
  • 任何仍然相关的 CPU 或微控制器在 un-cacheable 字节存储有额外的惩罚?
  • 任何不相关的历史 CPU(带或不带回写或直写缓存)是否满足以上任一条件?最近的例子是什么?
  • Any still-relevant CPU or microcontroller where cached byte stores have an extra penalty?
  • Any still-relevant CPU or microcontroller where un-cacheable byte stores have an extra penalty?
  • Any not-still-relevant historical CPUs (with or without write-back or write-through caches) where either of the above are true? What's the most recent example?

例如这是ARM Cortex-A的情况吗?还是 Cortex-M?任何较旧的ARM微架构?任何 MIPS 微控制器或早期的 MIPS 服务器/工作站 CPU?任何其他随机 RISC,如 PA-RISC,或 CISC,如 VAX 或 486?(CDC6600 是字可寻址的.)

e.g. is this the case on an ARM Cortex-A?? or Cortex-M? Any older ARM microarchitecture? Any MIPS microcontroller or early MIPS server/workstation CPU? Anything other random RISC like PA-RISC, or CISC like VAX or 486? (CDC6600 was word-addressable.)

或者构建一个包含负载和存储的测试用例,例如显示来自字节存储的 word-RMW 与负载吞吐量竞争.

(我不想显示从字节存储到字加载的存储转发比 word->word 慢,因为正常情况下,SF 仅在加载完全包含在最近的存储中时才能有效工作触摸任何相关字节.但是显示字节->字节转发效率低于字->字SF的东西会很有趣,可能是字节不是从字边界开始的.)

(I'm not interested in showing that store-forwarding from byte stores to word loads is slower than word->word, because it's normal that SF only works efficiently when when a load is fully contained in the most recent store to touch any of the relevant bytes. But something that showed byte->byte forwarding being less efficient than word->word SF would be interesting, maybe with bytes that don't start at a word boundary.)

(我没有提到字节加载,因为这通常很容易:从缓存或 RAM 中访问一个完整的单词,然后提取你想要的字节.除了 MMIO 之外,该实现细节是无法区分的,其中CPU 绝对不会读取包含的单词.)

(I didn't mention byte loads because that's generally easy: access a full word from cache or RAM and then extract the byte you want. That implementation detail is indistinguishable other than for MMIO, where CPUs definitely don't read the containing word.)

在像 MIPS 这样的加载/存储架构上,使用字节数据只是意味着您使用 lblbu 来加载和归零或符号扩展它,然后存储它返回 sb.(如果您需要在寄存器中的步骤之间截断为 8 位,那么您可能需要一条额外的指令,因此本地变量通常应该是寄存器大小的.除非您希望编译器使用 8 位元素的 SIMD 自动矢量化,否则通常是 uint8_t本地人很好...)但是无论如何,如果你做得对并且你的编译器很好,那么拥有字节数组不应该花费任何额外的指令.

On a load/store architecture like MIPS, working with byte data just means you use lb or lbu to load and zero or sign-extend it, then store it back with sb. (If you need truncation to 8 bits between steps in registers, then you might need an extra instruction, so local vars should usually be register sized. Unless you want the compiler to auto-vectorize with SIMD with 8-bit elements, then often uint8_t locals are good...) But anyway, if you do it right and your compiler is good, it shouldn't cost any extra instructions to have byte arrays.

我注意到 gcc 在 ARM、AArch64、x86 和 MIPS 上具有 sizeof(uint_fast8_t) == 1.但是 IDK 我们可以投入多少库存.x86-64 System V ABI 将 uint_fast32_t 定义为 x86-64 上的 64 位类型.如果他们打算这样做(而不是 32 位,它是 x86-64 的默认操作数大小),uint_fast8_t 也应该是 64 位类型.也许在用作数组索引时避免零扩展?如果它在寄存器中作为函数 arg 传递,因为如果你必须从内存中加载它,它可以免费进行零扩展.

I notice that gcc has sizeof(uint_fast8_t) == 1 on ARM, AArch64, x86, and MIPS. But IDK how much stock we can put in that. The x86-64 System V ABI defines uint_fast32_t as a 64-bit type on x86-64. If they're going to do that (instead of 32-bit which is x86-64's default operand-size), uint_fast8_t should also be a 64-bit type. Maybe to avoid zero-extension when used as an array index? If it was passed as a function arg in a register, since it could be zero extended for free if you had to load it from memory anyway.

推荐答案

我猜错了.现代 x86 微架构在这方面确实与某些(大多数?)其他 ISA 不同.

My guess was wrong. Modern x86 microarchitectures really are different in this way from some (most?) other ISAs.

即使在高性能的非 x86 CPU 上缓存窄存储也可能会受到惩罚.缓存占用空间的减少仍然可以使 int8_t 数组值得使用,不过.(在 MIPS 等一些 ISA 上,不需要为寻址模式缩放索引会有所帮助).

There can be a penalty for cached narrow stores even on high-performance non-x86 CPUs. The reduction in cache footprint can still make int8_t arrays worth using, though. (And on some ISAs like MIPS, not needing to scale an index for an addressing mode helps).

在实际提交到 L1d 之前,将字节存储指令之间的存储缓冲区中的存储缓冲区合并/合并到同一个字也可以减少或消除惩罚.(x86 有时无法做到这一点,因为它强大的内存模型要求所有存储都按程序顺序提交.)

Merging / coalescing in the store buffer between byte stores instructions to the same word before actual commit to L1d can also reduce or remove the penalty. (x86 sometimes can't do as much of this because its strong memory model requires all stores to commit in program order.)

ARM 的 Cortex 文档-A15 MPCore(从 ~2012 年开始)表示它在 L1d 中使用 32 位 ECC 粒度,实际上确实为窄存储执行了 word-RMW 以更新数据.

ARM's documentation for Cortex-A15 MPCore (from ~2012) says it uses 32-bit ECC granularity in L1d, and does in fact do a word-RMW for narrow stores to update the data.

L1 数据高速缓存支持标签和数据阵列中可选的单比特纠错和双比特检测纠错逻辑.标签数组的ECC粒度是单个缓存行的标签,数据数组的ECC粒度是32位字.

The L1 data cache supports optional single bit correct and double bit detect error correction logic in both the tag and data arrays. The ECC granularity for the tag array is the tag for a single cache line and the ECC granularity for the data array is a 32-bit word.

由于数据数组中的 ECC 粒度,对数组的写入无法更新 4 字节对齐的内存位置的一部分,因为没有足够的信息来计算新的 ECC 值.对于不写入一个或多个对齐的 4 字节内存区域的任何存储指令都是这种情况.在这种情况下,L1 数据存储系统读取缓存中的现有数据,合并修改后的字节,并根据合并后的值计算 ECC. L1 存储系统尝试将多个存储合并到一起以满足对齐的 4 字节 ECC 粒度并避免读-修改-写要求.

Because of the ECC granularity in the data array, a write to the array cannot update a portion of a 4-byte aligned memory location because there is not enough information to calculate the new ECC value. This is the case for any store instruction that does not write one or more aligned 4-byte regions of memory. In this case, the L1 data memory system reads the existing data in the cache, merges in the modified bytes, and calculates the ECC from the merged value. The L1 memory system attempts to merge multiple stores together to meet the aligned 4-byte ECC granularity and to avoid the read-modify-write requirement.

(当他们说L1 内存系统"时,我认为他们指的是存储缓冲区,如果您有尚未提交到 L1d 的连续字节存储.)

(When they say "the L1 memory system", I think they mean the store buffer, if you have contiguous byte stores that haven't yet committed to L1d.)

注意,RMW 是原子的,只涉及被修改的独占缓存行.这是一个不影响内存模型的实现细节.所以我对 现代 x86 硬件不能将单个字节存储到内存吗? 仍然(可能)正确,x86 可以,其他所有提供字节存储指令的 ISA 也可以.

Note that the RMW is atomic, and only involves the exclusively-owned cache line being modified. This is an implementation detail that doesn't affect the memory model. So my conclusion on Can modern x86 hardware not store a single byte to memory? is still (probably) correct that x86 can, and so can every other ISA that provides byte store instructions.

Cortex-A15 MPCore 是 3 路乱序执行CPU,因此它不是最小功率/简单的 ARM 设计,但他们选择将晶体管用于 OoO 执行而不是高效的字节存储.

Cortex-A15 MPCore is a 3-way out-of-order execution CPU, so it's not a minimal power / simple ARM design, yet they chose to spend transistors on OoO exec but not efficient byte stores.

大概不需要支持高效的未对齐存储(x86 软件更有可能假设/利用),因为 L1d 的 ECC 更高可靠性而没有过多开销,所以认为使用较慢的字节存储是值得的.

Presumably without the need to support efficient unaligned stores (which x86 software is more likely to assume / take advantage of), having slower byte stores was deemed worth it for the higher reliability of ECC for L1d without excessive overhead.

Cortex-A15 可能不是唯一的,也不是最新的 ARM 内核以这种方式工作.

Cortex-A15 is probably not the only, and not the most recent, ARM core to work this way.

其他示例(@HadiBrais 在评论中找到):

  1. Alpha 21264(参见 this doc) 的 L1d 缓存具有 8 字节 ECC 粒度.较窄的存储(包括 32 位)在提交到 L1d 时会导致 RMW,如果它们没有先合并到存储缓冲区中.该文档解释了 L1d 每个时钟可以做什么的全部细节.并且特别记录了存储缓冲区确实合并了存储.

  1. Alpha 21264 (see Table 8-1 of Chapter 8 of this doc) has 8-byte ECC granularity for its L1d cache. Narrower stores (including 32-bit) result in a RMW when they commit to L1d, if they aren't merged in the store buffer first. The doc explains full details of what L1d can do per clock. And specifically documents that the store buffer does coalesce stores.

PowerPC RS64-II 和 RS64-III(请参阅 this 文档).根据this abstract,RS/6000 处理器的 L1 有 7 位 ECC 用于每个 32 位数据.

PowerPC RS64-II and RS64-III (see the section on errors in this doc). According to this abstract, L1 of the RS/6000 processor has 7 bits of ECC for each 32-bits of data.

Alpha 从一开始就积极采用 64 位,因此 8 字节粒度是有意义的,尤其是当 RMW 成本大部分可以被存储缓冲区隐藏/吸收时.(例如,对于该 CPU 上的大多数代码来说,正常的瓶颈可能在其他地方;它的多端口缓存通常每个时钟可以处理 2 个操作.)

Alpha was aggressively 64-bit from the ground up, so 8-byte granularity makes some sense, especially if the RMW cost can mostly be hidden / absorbed by the store buffer. (e.g. maybe the normal bottlenecks were elsewhere for most code on that CPU; its multi-ported cache could normally handle 2 operations per clock.)

POWER/PowerPC64 从 32 位 PowerPC 发展而来,可能关心使用 32 位整数和指针运行 32 位代码.(因此更有可能对无法合并的数据结构进行不连续的 32 位存储.)因此 32 位 ECC 粒度在那里很有意义.

POWER / PowerPC64 grew out of 32-bit PowerPC and probably cares about running 32-bit code with 32-bit integers and pointers. (So more likely to do non-contiguous 32-bit stores to data structures that couldn't be coalesced.) So 32-bit ECC granularity makes a lot of sense there.

这篇关于是否有任何现代 CPU 缓存字节存储实际上比字存储慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆