是否有任何现代CPU的缓存字节存储实际上比字存储慢? [英] Are there any modern CPUs where a cached byte store is actually slower than a word store?

查看:69
本文介绍了是否有任何现代CPU的缓存字节存储实际上比字存储慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是常见声明将字节存储到缓存中可能会导致内部读-修改-写周期,或者与存储完整的寄存器相比,会损害吞吐量或延迟.

It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register.

但是我从未见过任何示例.没有x86 CPU是这样的,我认为所有高性能CPU都可以直接修改高速缓存行中的任何字节.如果某些微控制器或低端CPU完全具有缓存,是否有所不同?

(我不算字寻址机器,或者说Alpha是字节可寻址的,但是缺少字节加载/存储指令.我说的是ISA本机支持的最窄存储指令.)

(I'm not counting word-addressable machines, or Alpha which is byte addressable but lacks byte load/store instructions. I'm talking about the narrowest store instruction the ISA natively supports.)

在我的研究中,回答了 Can modern x86硬件不能将单个字节存储到内存吗?,我发现Alpha AXP省略字节存储的原因是假定它们将被实现为将真正的字节存储到缓存中,而不是包含字的RMW更新. (因此,本来使L1d缓存的ECC保护更加昂贵,因为它需要字节粒度而不是32位).

In my research while answering Can modern x86 hardware not store a single byte to memory?, I found that the reasons Alpha AXP omitted byte stores presumed they'd be implemented as true byte stores into cache, not an RMW update of the containing word. (So it would have made ECC protection for L1d cache more expensive, because it would need byte granularity instead of 32-bit).

我假设提交给L1d缓存期间的word-RMW不被视为其他最近实现了字节存储的ISA的实现选项.

所有现代体系结构(早期的Alpha除外)都可以对不可缓存的MMIO区域(不是RMW周期)进行真正的字节加载/存储,这对于为具有相邻字节I/O寄存器的设备编写设备驱动程序是必需的. (例如,使用外部启用/禁用信号来指定更宽的总线的哪些部分保存实际数据,例如上的2位TSIZ(传输大小)这种ColdFire CPU/微控制器,或类似PCI/PCIe单字节传输,或类似掩盖所选字节的DDR SDRAM控制信号.)

All modern architectures (other than early Alpha) can do true byte loads/stores to uncacheable MMIO regions (not RMW cycles), which is necessary for writing device drivers for devices that have adjacent byte I/O registers. (e.g. with external enable/disable signals to specify which parts of a wider bus hold the real data, like the 2-bit TSIZ (transfer size) on this ColdFire CPU/microcontroller, or like PCI / PCIe single byte transfers, or like DDR SDRAM control signals that mask selected bytes.)

即使不是针对像Alpha这样的SMP服务器/工作站的高端超标量流水线设计,在微控制器设计中也许要考虑在字节存储的高速缓存中进行RMW循环?

Maybe doing an RMW cycle in cache for byte stores would be something to consider for a microcontroller design, even though it's not for a high-end superscalar pipelined design aimed at SMP servers / workstations like Alpha?

我认为这种说法可能来自字寻址机器.或是来自需要在多个CPU上进行多次访问的不对齐的32位存储,而人们却错误地将其归纳为字节存储.

I think this claim might come from word-addressable machines. Or from unaligned 32-bit stores requiring multiple accesses on many CPUs, and people incorrectly generalizing from that to byte stores.

请明确一点,我希望到同一地址的字节存储循环将与字存储循环在每次迭代中以相同的周期运行.因此,为了填充阵列,32位存储区的存储速度可以比8位存储区快4倍. (如果32位存储饱和内存带宽,而8位存储没有饱和内存,则可能会更少.)但是,除非字节存储有额外的损失,否则您不会获得比4倍速差更多的 . (或任何宽度的单词).

Just to be clear, I expect that a byte store loop to the same address would run at the same cycles per iterations as a word store loop. So for filling an array, 32-bit stores can go up to 4x faster than 8-bit stores. (Maybe less if 32-bit stores saturate memory bandwidth but 8-bit stores don't.) But unless byte stores have an extra penalty, you won't get more than a 4x speed difference. (Or whatever the word width is).

我说的是asm.一个好的编译器会自动向量化C中的字节或int存储循环,并使用更宽的存储空间或目标ISA上最合适的存储空间(如果它们是连续的).

And I'm talking about asm. A good compiler will auto-vectorize a byte or int store loop in C and use wider stores or whatever is optimal on the target ISA, if they're contiguous.

(而且在存储缓冲区中进行存储合并还可能导致对连续字节存储指令的L1d高速缓存进行更广泛的提交,因此在进行微基准测试时要提防另一件事)

(And store coalescing in the store buffer could also result in wider commits to L1d cache for contiguous byte-store instructions, so that's another thing to watch out for when microbenchmarking)

; x86-64 NASM syntax
mov   rdi, rsp
; RDI holds at a 32-bit aligned address
mov   ecx, 1000000000
.loop:                      ; do {
    mov   byte [rdi], al
    mov   byte [rdi+2], dl     ; store two bytes in the same dword
      ; no pointer increment, this is the same 32-bit dword every time
    dec   ecx
    jnz   .loop             ; }while(--ecx != 0}


    mov   eax,60
    xor   edi,edi
    syscall         ; x86-64 Linux sys_exit(0)

或者像这样在8kiB数组上循环,每8个字节存储1个字节或1个字(对于8kiB的sizeof(unsigned int)= 4和CHAR_BIT = 8的C实现函数可以在任何C实现中使用,如果sizeof(unsigned int)不是2的幂的乘方,则只有很小的偏差. ASM 上Godbolt几个不同ISA 的,要么没有展开,要么相同两种版本的展开量.

Or a loop over an 8kiB array like this, storing 1 byte or 1 word out of every 8 bytes (for a C implementation with sizeof(unsigned int)=4 and CHAR_BIT=8 for the 8kiB, but should compile to comparable functions on any C implementation, with only a minor bias if sizeof(unsigned int) isn't a power of 2). ASM on Godbolt for a few different ISAs, with either no unrolling, or the same amount of unrolling for both versions.

// volatile defeats auto-vectorization
void byte_stores(volatile unsigned char *arr) {
    for (int outer=0 ; outer<1000 ; outer++)
        for (int i=0 ; i< 1024 ; i++)      // loop over 4k * 2*sizeof(int) chars
            arr[i*2*sizeof(unsigned) + 1] = 123;    // touch one byte of every 2 words
}

// volatile to defeat auto-vectorization: x86 could use AVX2 vpmaskmovd
void word_stores(volatile unsigned int *arr) {
    for (int outer=0 ; outer<1000 ; outer++)
        for (int i=0 ; i<(1024 / sizeof(unsigned)) ; i++)  // same number of chars
            arr[i*2 + 0] = 123;       // touch every other int
}

根据需要调整大小,如果有人可以指向word_store()byte_store()快的系统,我真的很好奇. (如果实际进行基准测试,请提防诸如动态时钟速度之类的预热效应,以及触发TLB未命中和缓存未命中的第一次通过.)

Adjusting sizes as necessary, I'd be really curious if anyone can point to a system where word_store() is faster than byte_store(). (If actually benchmarking, beware of warm-up effects like dynamic clock speed, and the first pass triggering TLB misses and cache misses.)

或者,如果不存在用于古代平台的实际C编译器,或者生成的次优代码不会对商店的吞吐量造成瓶颈,那么任何手工制作的asm都会显示出效果.

Or if actual C compilers for ancient platforms don't exist or generate sub-optimal code that doesn't bottleneck on store throughput, then any hand-crafted asm that would show an effect.

演示字节存储减慢的任何其他方法都很好,我不坚持在数组上执行跨越式循环或在一个字内发送垃圾邮件.

Any other way of demonstrating a slowdown for byte stores is fine, I don't insist on strided loops over arrays or spamming writes within one word.

我也可以使用有关CPU内部的详细文档,或者使用不同指令的CPU周期计时编号.不过,我对基于此声明而无需进行测试的优化建议或指南持怀疑态度.

I'd also be fine with detailed documentation about CPU internals, or CPU cycle timing numbers for different instructions. I'm leery of optimization advice or guides that could be based on this claim without having tested, though.

  • 是否还有任何与之相关的CPU或微控制器,其中高速缓存的字节存储区都会产生额外的惩罚?
  • 还有任何仍与之相关的CPU或微控制器,其中不可缓存字节存储会产生额外的惩罚?
  • 是否有不符合上述条件的历史CPU(具有或不具有回写或直写式高速缓存)?最近的例子是什么?
  • Any still-relevant CPU or microcontroller where cached byte stores have an extra penalty?
  • Any still-relevant CPU or microcontroller where un-cacheable byte stores have an extra penalty?
  • Any not-still-relevant historical CPUs (with or without write-back or write-through caches) where either of the above are true? What's the most recent example?

例如在ARM Cortex-A上是这种情况吗?还是Cortex-M?任何较旧的ARM微体系结构?是否有MIPS微控制器或早期的MIPS服务器/工作站CPU?还有其他随机RISC(例如PA-RISC)或CISC(例如VAX或486)吗? (CDC6600可通过字寻址.)

e.g. is this the case on an ARM Cortex-A?? or Cortex-M? Any older ARM microarchitecture? Any MIPS microcontroller or early MIPS server/workstation CPU? Anything other random RISC like PA-RISC, or CISC like VAX or 486? (CDC6600 was word-addressable.)

或构建一个涉及负载和存储的测试用例,例如显示字节存储中的字RMW与负载吞吐量竞争.

(我不感兴趣的是显示从字节存储到字加载的存储转发比word-> word慢,因为正常情况下,只有当最新存储中完全包含负载时,SF才能有效地工作,这是正常的.触摸任何相关的字节.但是显示字节->字节转发的效率低于word-> word SF的东西可能会很有趣,也许字节不是从单词边界开始的.)

(I'm not interested in showing that store-forwarding from byte stores to word loads is slower than word->word, because it's normal that SF only works efficiently when when a load is fully contained in the most recent store to touch any of the relevant bytes. But something that showed byte->byte forwarding being less efficient than word->word SF would be interesting, maybe with bytes that don't start at a word boundary.)

(我没有提到字节加载,因为通常很简单:从缓存或RAM中访问一个完整的字,然后提取所需的字节.除了MMIO之外,实现细节是无法区分的CPU绝对不会读取包含的单词.)

(I didn't mention byte loads because that's generally easy: access a full word from cache or RAM and then extract the byte you want. That implementation detail is indistinguishable other than for MMIO, where CPUs definitely don't read the containing word.)

在像MIPS这样的加载/存储体系结构上,使用字节数据仅意味着您使用lblbu进行加载和归零或符号扩展,然后使用sb将其存储回去. (如果需要在寄存器中的各步之间将截断到8位,则可能需要一条额外的指令,因此通常应按寄存器大小确定本地var的大小.除非希望编译器使用具有8位元素的SIMD自动向量化,否则通常为uint8_t当地人是好人...)但是无论如何,如果您做得正确并且编译器是好人,那么拥有字节数组就不需要花费任何额外的指令.

On a load/store architecture like MIPS, working with byte data just means you use lb or lbu to load and zero or sign-extend it, then store it back with sb. (If you need truncation to 8 bits between steps in registers, then you might need an extra instruction, so local vars should usually be register sized. Unless you want the compiler to auto-vectorize with SIMD with 8-bit elements, then often uint8_t locals are good...) But anyway, if you do it right and your compiler is good, it shouldn't cost any extra instructions to have byte arrays.

我注意到gcc在ARM,AArch64,x86和MIPS上具有sizeof(uint_fast8_t) == 1.但是IDK我们可以投入多少库存. x86-64系统V ABI将uint_fast32_t定义为x86-64上的64位类型.如果他们要这样做(而不是x86-64的默认操作数大小为32位),则uint_fast8_t也应该是64位类型.也许在用作数组索引时避免零扩展?如果将它作为函数arg传递到寄存器中,则由于无论如何都必须从内存中加载它,因此可以免费将其扩展为零.

I notice that gcc has sizeof(uint_fast8_t) == 1 on ARM, AArch64, x86, and MIPS. But IDK how much stock we can put in that. The x86-64 System V ABI defines uint_fast32_t as a 64-bit type on x86-64. If they're going to do that (instead of 32-bit which is x86-64's default operand-size), uint_fast8_t should also be a 64-bit type. Maybe to avoid zero-extension when used as an array index? If it was passed as a function arg in a register, since it could be zero extended for free if you had to load it from memory anyway.

推荐答案

我的猜测是错误的.现代x86微体系结构确实与某些(大多数?)其他ISA有所不同.

My guess was wrong. Modern x86 microarchitectures really are different in this way from some (most?) other ISAs.

即使在高性能非x86 CPU上,缓存的狭窄存储区也可能会受到惩罚.不过,缓存占用空间的减少仍然可以使int8_t阵列值得使用. (在某些ISA(例如MIPS)上,无需为寻址模式扩展索引会有所帮助.)

There can be a penalty for cached narrow stores even on high-performance non-x86 CPUs. The reduction in cache footprint can still make int8_t arrays worth using, though. (And on some ISAs like MIPS, not needing to scale an index for an addressing mode helps).

在实际提交给L1d之前,在字节存储指令之间将存储指令合并/合并到同一字的字节中也可以减少或消除代价. (x86有时不能做很多事情,因为它强大的内存模型要求所有存储都按程序顺序提交.)

Merging / coalescing in the store buffer between byte stores instructions to the same word before actual commit to L1d can also reduce or remove the penalty. (x86 sometimes can't do as much of this because its strong memory model requires all stores to commit in program order.)

ARM针对Cortex的文档-A15 MPCore (从2012年开始)表示,它在L1d中使用32位ECC粒度,实际上对狭窄的存储区进行了字RMW操作来更新数据.

ARM's documentation for Cortex-A15 MPCore (from ~2012) says it uses 32-bit ECC granularity in L1d, and does in fact do a word-RMW for narrow stores to update the data.

L1数据高速缓存在标签和数据阵列中均支持可选的单比特纠正和双比特检测错误纠正逻辑.标签阵列的ECC粒度是单个缓存行的标签,数据阵列的ECC粒度是32位字.

The L1 data cache supports optional single bit correct and double bit detect error correction logic in both the tag and data arrays. The ECC granularity for the tag array is the tag for a single cache line and the ECC granularity for the data array is a 32-bit word.

由于数据阵列中的ECC粒度,对阵列的写操作无法更新4字节对齐的内存位置的一部分,因为没有足够的信息来计算新的ECC值.对于任何不写入一个或多个对齐的4字节内存区域的存储指令,都是这种情况. 在这种情况下,L1数据存储系统读取高速缓存中的现有数据,合并修改后的字节,然后根据合并后的值计算ECC.满足对齐的4字节ECC粒度,并避免了读取-修改-写入要求.

Because of the ECC granularity in the data array, a write to the array cannot update a portion of a 4-byte aligned memory location because there is not enough information to calculate the new ECC value. This is the case for any store instruction that does not write one or more aligned 4-byte regions of memory. In this case, the L1 data memory system reads the existing data in the cache, merges in the modified bytes, and calculates the ECC from the merged value. The L1 memory system attempts to merge multiple stores together to meet the aligned 4-byte ECC granularity and to avoid the read-modify-write requirement.

(当他们说"L1内存系统"时,如果您有尚未提交给L1d的连续字节存储,我认为它们的意思是存储缓冲区.)

(When they say "the L1 memory system", I think they mean the store buffer, if you have contiguous byte stores that haven't yet committed to L1d.)

请注意,RMW是原子的,并且只涉及被修改的专有缓存行.这是不影响内存模型的实现细节.因此,我对

Note that the RMW is atomic, and only involves the exclusively-owned cache line being modified. This is an implementation detail that doesn't affect the memory model. So my conclusion on Can modern x86 hardware not store a single byte to memory? is still (probably) correct that x86 can, and so can every other ISA that provides byte store instructions.

Cortex-A15 MPCore 是3次无序执行CPU,所以这不是最小的功耗/简单的ARM设计,但是他们选择在OoO exec上花费晶体管,而不是高效的字节存储.

Cortex-A15 MPCore is a 3-way out-of-order execution CPU, so it's not a minimal power / simple ARM design, yet they chose to spend transistors on OoO exec but not efficient byte stores.

大概不需要支持高效的未对齐存储(x86软件更可能采用/利用此存储),因为L1d的ECC可靠性更高而没有过多开销,因此拥有较慢的字节存储被认为是值得的.

Presumably without the need to support efficient unaligned stores (which x86 software is more likely to assume / take advantage of), having slower byte stores was deemed worth it for the higher reliability of ECC for L1d without excessive overhead.

Cortex-A15可能不是唯一的,也不是最新的以这种方式工作的ARM内核.

Cortex-A15 is probably not the only, and not the most recent, ARM core to work this way.

其他示例(由@HadiBrais在评论中找到):

  1. Alpha 21264 (请参见文档的L1d缓存具有8字节ECC粒度.如果较窄的存储区(包括32位)没有提交给L1d,则它们将首先提交到R1d中,从而导致RMW.该文档详细说明了L1d每个时钟可以做什么.并特别说明存储缓冲区确实可以合并存储.

  1. Alpha 21264 (see Table 8-1 of Chapter 8 of this doc) has 8-byte ECC granularity for its L1d cache. Narrower stores (including 32-bit) result in a RMW when they commit to L1d, if they aren't merged in the store buffer first. The doc explains full details of what L1d can do per clock. And specifically documents that the store buffer does coalesce stores.

PowerPC RS64-II和RS64-III (请参见文档).根据此摘要,RS/6000处理器的L1具有7位ECC用于每个32位数据.

PowerPC RS64-II and RS64-III (see the section on errors in this doc). According to this abstract, L1 of the RS/6000 processor has 7 bits of ECC for each 32-bits of data.

Alpha从头开始就积极地使用64位,因此8字节的粒度是有意义的,尤其是在RMW成本大部分可以被存储缓冲区隐藏/吸收的情况下. (例如,对于该CPU上大多数代码而言,正常的瓶颈可能在其他地方;它的多端口缓存通常每个时钟可以处理2次操作.)

Alpha was aggressively 64-bit from the ground up, so 8-byte granularity makes some sense, especially if the RMW cost can mostly be hidden / absorbed by the store buffer. (e.g. maybe the normal bottlenecks were elsewhere for most code on that CPU; its multi-ported cache could normally handle 2 operations per clock.)

POWER/PowerPC64源自32位PowerPC,它可能关心使用32位整数和指针运行32位代码. (因此,更有可能对无法合并的数据结构进行非连续的32位存储.)因此,32位ECC粒度在这里很有意义.

POWER / PowerPC64 grew out of 32-bit PowerPC and probably cares about running 32-bit code with 32-bit integers and pointers. (So more likely to do non-contiguous 32-bit stores to data structures that couldn't be coalesced.) So 32-bit ECC granularity makes a lot of sense there.

这篇关于是否有任何现代CPU的缓存字节存储实际上比字存储慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆