现代 x86 硬件不能将单个字节存储到内存中吗? [英] Can modern x86 hardware not store a single byte to memory?

查看:20
本文介绍了现代 x86 硬件不能将单个字节存储到内存中吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谈到 C++ 的并发内存模型,Stroustrup 的 C++ 编程语言,,第 4 版,第 4 节.41.2.1,说:

Speaking of the memory model of C++ for concurrency, Stroustrup's C++ Programming Language, 4th ed., sect. 41.2.1, says:

...(像大多数现代硬件一样)机器无法加载或存储小于一个字的任何东西.

... (like most modern hardware) the machine could not load or store anything smaller than a word.

但是,我的 x86 处理器已经使用了几年,可以并且确实可以存储小于一个字的对象.例如:

However, my x86 processor, a few years old, can and does store objects smaller than a word. For example:

#include <iostream>
int main()
{
    char a =  5;
    char b = 25;
    a = b;
    std::cout << int(a) << "
";
    return 0;
}

未经优化,GCC 将其编译为:

Without optimization, GCC compiles this as:

        [...]
        movb    $5, -1(%rbp)   # a =  5, one byte
        movb    $25, -2(%rbp)  # b = 25, one byte
        movzbl  -2(%rbp), %eax # load b, one byte, not extending the sign
        movb    %al, -1(%rbp)  # a =  b, one byte
        [...]

评论是我写的,但大会是 GCC 写的.当然,它运行良好.

The comments are by me but the assembly is by GCC. It runs fine, of course.

显然,当 Stroustrup 解释硬件可以加载和存储任何小于一个字的东西时,我不明白他在说什么.据我所知,我的程序什么都不做,只是加载和存储小于一个字的对象.

Obviously, I do not understand what Stroustrup is talking about when he explains that hardware can load and store nothing smaller than a word. As far as I can tell, my program does nothing but load and store objects smaller than a word.

C++ 对零成本、硬件友好抽象的彻底关注使 C++ 与其他更容易掌握的编程语言区别开来.因此,如果 Stroustrup 对总线上的信号有一个有趣的心智模型,或者有其他类似的东西,那么我想了解 Stroustrup 的模型.

The thoroughgoing focus of C++ on zero-cost, hardware-friendly abstractions sets C++ apart from other programming languages that are easier to master. Therefore, if Stroustrup has an interesting mental model of signals on a bus, or has something else of this kind, then I would like to understand Stroustrup's model.

请问 Stroustrup 在说什么?

What is Stroustrup talking about, please?

更长的上下文引用

以下是 Stroustrup 在更完整上下文中的引用:

Here is Stroustrup's quote in fuller context:

考虑如果链接器在内存中的同一个字中分配了[char类型的变量]cb可能会发生什么,并且(像大多数现代硬件一样)机器无法加载或存储任何小于一个字的东西......如果没有定义明确且合理的内存模型,线程 1 可能会读取包含 b 的字c,改变c,把单词写回内存.同时,线程 2 可以对 b 做同样的事情.然后,无论哪个线程首先设法读取单词,哪个线程最后设法将其结果写回内存将确定结果......

Consider what might happen if a linker allocated [variables of char type like] c and b in the same word in memory and (like most modern hardware) the machine could not load or store anything smaller than a word.... Without a well-defined and reasonable memory model, thread 1 might read the word containing b and c, change c, and write the word back into memory. At the same time, thread 2 could do the same with b. Then, whichever thread managed to read the word first and whichever thread managed to write its result back into memory last would determine the result....

附加说明

我不相信 Stroustrup 在谈论缓存行.即使他是,据我所知,缓存一致性协议也会透明地处理这个问题,除非在硬件 I/O 期间.

I do not believe that Stroustrup is talking about cache lines. Even if he were, as far as I know, cache coherency protocols would transparently handle that problem except maybe during hardware I/O.

我检查了处理器的硬件数据表.在电气方面,我的处理器(Intel Ivy Bridge)似乎通过某种 16 位多路复用方案来寻址 DDR3L 内存,所以我不知道那是什么.不过,我不清楚这与 Stroustrup 的观点有多大关系.

I have checked my processor's hardware datasheet. Electrically, my processor (an Intel Ivy Bridge) seems to address DDR3L memory by some sort of 16-bit multiplexing scheme, so I don't know what that's about. It is not clear to me that that has much to do with Stroustrup's point, though.

Stroustrup 是一个聪明人和杰出的科学家,所以我不怀疑他在做一些明智的事情.我很困惑.

Stroustrup is a smart man and an eminent scientist, so I do not doubt that he is taking about something sensible. I am confused.

另见这个问题.我的问题在几个方面类似于链接问题,链接问题的答案是在这里也很有帮助.但是,我的问题还涉及硬件/总线模型,它促使 C++ 成为现在的样子,并导致 Stroustrup 编写他所写的内容.我不仅仅寻求关于 C++ 标准正式保证的答案,而且希望了解为什么 C++ 标准会保证它.背后的想法是什么?这也是我的问题的一部分.

See also this question. My question resembles the linked question in several ways, and the answers to the linked question are also helpful here. However, my question goes also to the hardware/bus model that motivates C++ to be the way it is and that causes Stroustrup to write what he writes. I do not seek an answer merely regarding that which the C++ standard formally guarantees, but also wish to understand why the C++ standard would guarantee it. What is the underlying thought? This is part of my question, too.

推荐答案

TL:DR:在每个具有字节存储指令(包括 x86)的现代 ISA 上,它们都是原子的,不会干扰周围的字节.(我不知道有任何较旧的 ISA,其中字节存储指令也可以发明写入"到相邻字节.)

TL:DR: On every modern ISA that has byte-store instructions (including x86), they're atomic and don't disturb surrounding bytes. (I'm not aware of any older ISAs where byte-store instructions could "invent writes" to neighbouring bytes either.)

实际实现机制(在非 x86 CPU 中) 有时是一个内部 RMW 循环来修改缓存行中的整个单词,但这是在内核内部隐形"完成的,同时它拥有对缓存行所以它只是一个性能问题,而不是正确性.(并且在存储缓冲区中合并有时可以将字节存储指令转换为对 L1d 缓存的有效全字提交.)

The actual implementation mechanism (in non-x86 CPUs) is sometimes an internal RMW cycle to modify a whole word in a cache line, but that's done "invisibly" inside a core while it has exclusive ownership of the cache line so it's only ever a performance problem, not correctness. (And merging in the store buffer can sometimes turn byte-store instructions into an efficient full-word commit to L1d cache.)

我认为这不是一个非常准确、清晰或有用的陈述.更准确地说,现代 CPU 无法加载或存储任何小于缓存线的内容.(尽管对于不可缓存的内存区域,例如 MMIO,情况并非如此.)

I don't think it's a very accurate, clear or useful statement. It would be more accurate to say that modern CPUs can't load or store anything smaller than a cache line. (Although that's not true for uncacheable memory regions, e.g. for MMIO.)

最好只是举一个假设的例子来讨论内存模型,而不是暗示真正的硬件是这样的.但是,如果我们尝试,我们可能会找到一种解释,它并不那么明显或完全错误,这可能是 Stroustrup 在写这篇文章以介绍内存模型主题时的想法.(对不起,这个答案太长了;我最后写了很多,同时猜测他的意思和相关主题......)

It probably would have been better just to make a hypothetical example to talk about memory models, rather than implying that real hardware is like this. But if we try, we can maybe find an interpretation that isn't as obviously or totally wrong, which might have been what Stroustrup was thinking when he wrote this to introduce the topic of memory models. (Sorry this answer is so long; I ended up writing a lot while guessing what he might have meant and about related topics...)

或者这可能是高级语言设计师不是硬件专家的另一个例子,或者至少偶尔会做出错误的陈述.

Or maybe this is another case of high-level language designers not being hardware experts, or at least occasionally making mis-statements.

我认为 Stroustrup 正在谈论 CPU 如何在内部工作以实现字节存储指令.他建议一个没有明确定义和合理的内存模型的 CPU 可能会在缓存行中实现一个包含字的非原子 RMW 的字节存储,或者在没有缓存的 CPU 的内存中.

I think Stroustrup is talking about how CPUs work internally to implement byte-store instructions. He's suggesting that a CPU without a well-defined and reasonable memory model might implement a byte-store with a non-atomic RMW of the containing word in a cache line, or in memory for a CPU without cache.

即使是关于内部(外部不可见)行为的这种较弱的声明也不适用于高性能 x86 CPU.现代英特尔 CPU 对字节存储,甚至不跨越缓存线边界的未对齐字或向量存储没有吞吐量损失.AMD 也类似.

Even this weaker claim about internal (not externally visible) behaviour is not true for high-performance x86 CPUs. Modern Intel CPUs have no throughput penalty for byte stores, or even unaligned word or vector stores that don't cross a cache-line boundary. AMD is similar.

如果字节或未对齐的存储必须在存储提交给 L1D 缓存时执行 RMW 循环,它将以我们可以用性能计数器衡量的方式干扰存储和/或加载指令/uop 吞吐量.(在精心设计的实验中,避免了在提交到 L1d 缓存之前存储缓冲区中存储合并隐藏成本的可能性,因为存储执行单元在当前 CPU 上每个时钟只能运行 1 个存储.)

If byte or unaligned stores had to do a RMW cycle as the store committed to L1D cache, it would interfere with store and/or load instruction/uop throughput in a way we could measure with performance counters. (In a carefully designed experiment that avoids the possibility of store coalescing in the store buffer before commit to L1d cache hiding the cost, because the store execution unit(s) can only run 1 store per clock on current CPUs.)

但是,一些针对非 x86 ISA 的高性能设计确实使用原子 RMW 循环在内部将存储提交到 L1d 缓存.是否有任何现代 CPU 的缓存字节存储实际上比字存储慢? 缓存行始终处于 MESI 独占/修改状态,因此不会引入任何正确性问题,只有很小的性能损失.这与做一些可以从其他 CPU 上踩到存储的事情非常不同.(下面关于 没有发生的论点仍然适用,但我的更新可能遗漏了一些仍然认为原子缓存 RMW 不太可能的内容.)

However, some high performance designs for non-x86 ISAs do use an atomic RMW cycle to internally commit stores to L1d cache. Are there any modern CPUs where a cached byte store is actually slower than a word store? The cache line stays in MESI Exclusive/Modified state the whole time, so it can't introduce any correctness problems, only a small performance hit. This is very different from doing something that could step on stores from other CPUs. (The arguments below about that not happening still apply, but my update may have missed some stuff that still argues that atomic cache-RMW is unlikely.)

(在许多非 x86 ISA 上,根本不支持未对齐的存储,或者比 x86 软件中使用的更少.弱排序的 ISA 允许在存储缓冲区中进行更多合并,因此实际产生的字节存储指令并不多在 L1d 的单字节提交中.如果没有这些花哨(耗电)缓存访问硬件的动机,分散字节存储的字 RMW 在某些设计中是可以接受的折衷.)

(On many non-x86 ISAs, unaligned stores are not supported at all, or are used more rarely than in x86 software. And weakly-ordered ISAs allow more coalescing in store buffers, so not as many byte store instructions actually result in single-byte commit to L1d. Without these motivations for fancy (power hungry) cache-access hardware, word RMW for scattered byte stores is an acceptable tradeoff in some designs.)

Alpha AXP,一种 1992 年的高性能 RISC 设计,著名(并且在现代非 DSP ISA 中独一无二)省略字节加载/存储指令直到 1996 年的 Alpha 21164A (EV56).显然,他们不认为 word-RMW 是实现字节存储的可行选择,因为仅实现 32 位和 64 位对齐存储的引用优势之一是 L1D 缓存的更高效 ECC."传统的 SECDED ECC 需要 7 个额外的位超过 32-位颗粒(22% 的开销)与 8 位颗粒的 4 个额外位(50% 的开销)相比."(@Paul A. Clayton 关于字与字节寻址的回答还有一些其他有趣的计算机架构内容.)如果字节存储是使用 word-RMW 实现的,您仍然可以使用 word-granularity 进行错误检测/纠正.

Alpha AXP, a high-performance RISC design from 1992, famously (and uniquely among modern non-DSP ISAs) omitted byte load/store instructions until Alpha 21164A (EV56) in 1996. Apparently they didn't consider word-RMW a viable option for implementing byte stores, because one of the cited advantages for implementing only 32-bit and 64-bit aligned stores was more efficient ECC for the L1D cache. "Traditional SECDED ECC would require 7 extra bits over 32-bit granules (22% overhead) versus 4 extra bits over 8-bit granules (50% overhead)." (@Paul A. Clayton's answer about word vs. byte addressing has some other interesting computer-architecture stuff.) If byte stores were implemented with word-RMW, you could still do error detection/correction with word-granularity.

出于这个原因,当前的 Intel CPU 仅在 L1D 中使用奇偶校验(而非 ECC).请参阅这个关于硬件(不是)消除静默存储"的问答:在写入之前检查缓存的旧内容以避免将行标记为脏(如果匹配)将需要 RMW 而不仅仅是存储,这是一个主要问题障碍.

Current Intel CPUs only use parity (not ECC) in L1D for this reason. See this Q&A about hardware (not) eliminating "silent stores": checking the old contents of cache before the write to avoid marking the line dirty if it matched would require a RMW instead of just a store, and that's a major obstacle.

事实证明,一些高性能流水线设计确实使用原子字-RMW 来提交到 L1d,尽管它会拖延内存流水线,但是(正如我在下面讨论的那样)对 RAM 执行外部可见的 RMW.

It turns out some high-perf pipelined designs do use atomic word-RMW to commit to L1d, despite it stalling the memory pipeline, but (as I argue below) it's much less likely that any do an externally-visible RMW to RAM.

Word-RMW 对于 MMIO 字节存储,所以除非你的架构不需要 IO 的子字存储,您需要对 IO 进行某种特殊处理(例如 Alpha 的稀疏 I/O空间,其中字加载/存储被映射到字节加载/存储,因此它可以使用商品 PCI 卡,而不需要没有字节 IO 寄存器的特殊硬件).

Word-RMW isn't a useful option for MMIO byte stores either, so unless you have an architecture that doesn't need sub-word stores for IO, you'd need some kind of special handling for IO (like Alpha's sparse I/O space where word load/stores were mapped to byte load/stores so it could use commodity PCI cards instead of needing special hardware with no byte IO registers).

作为 @Margaret 指出,DDR3 内存控制器可以通过设置控制信号来屏蔽突发的其他字节来进行字节存储.将此信息获取到内存控制器(对于未缓存的存储)的相同机制也可以获取该信息与加载或存储一起传递到 MMIO 空间.所以有硬件机制可以真正做到即使在面向突发的内存系统上也可以使用字节存储,而且现代 CPU 很可能会使用它而不是实现 RMW,因为它可能更简单并且 对 MMIO 正确性更好.

As @Margaret points out, DDR3 memory controllers can do byte stores by setting control signals that mask out other bytes of a burst. The same mechanisms that get this information to the memory controller (for uncached stores) could also get that information passed along with a load or store to MMIO space. So there are hardware mechanisms for really doing a byte store even on burst-oriented memory systems, and it's highly likely that modern CPUs will use that instead of implementing an RMW, because it's probably simpler and is much better for MMIO correctness.

将长字传输到 CPU 需要多少个和多大的周期 显示了 ColdFire 微控制器如何用外部信号线指示传输大小(字节/字/长字/16 字节线),strong> 让它执行字节加载/存储,即使 32 位宽的内存连接到它的 32 位数据总线.对于大多数内存总线设置来说,这样的事情大概是典型的(但我不知道).ColdFire 示例由于还可以配置为使用 16 位或 8 位内存而变得复杂,这需要额外的周期来进行更广泛的传输.但不要紧,重要的一点是它有外部信号传输大小,告诉内存硬件它实际写入的字节.

How many and what size cycles will be needed to perform longword transferred to the CPU shows how a ColdFire microcontroller signals the transfer size (byte/word/longword/16-byte line) with external signal lines, letting it do byte loads/stores even if 32-bit-wide memory was hooked up to its 32-bit data bus. Something like this is presumably typical for most memory bus setups (but I don't know). The ColdFire example is complicated by also being configurable to use 16 or 8-bit memory, taking extra cycles for wider transfers. But nevermind that, the important point is that it has external signaling for the transfer size, to tell the memory HW which byte it's actually writing.

Stroustrup 的 下一段

"C++ 内存模型保证 两个执行线程可以更新和访问不同的内存位置,而不会相互干扰.这正是我们天真地期望的.编译器的工作是保护我们免受现代硬件有时非常奇怪和微妙的行为的影响.编译器和硬件组合如何实现这一点取决于编译器...."

"The C++ memory model guarantees that two threads of execution can update and access separate memory locations without interfering with each other. This is exactly what we would naively expect. It is the compiler’s job to protect us from the sometimes very strange and subtle behaviors of modern hardware. How a compiler and hardware combination achieves that is up to the compiler. ..."

很明显,他认为真正的现代硬件可能无法提供安全"的字节加载/存储.设计硬件内存模型的人同意 C/C++ 人的观点,他们意识到如果字节存储指令可以踩到相邻的字节,那么字节存储指令对程序员/编译器来说不是很有用.

So apparently he thinks that real modern hardware may not provide "safe" byte load/store. The people who design hardware memory models agree with the C/C++ people, and realize that byte store instructions would not be very useful to programmers / compilers if they could step on neighbouring bytes.

除了早期的 Alpha AXP 之外,所有现代(非 DSP)架构都有字节存储和加载指令,而 AFAIK 这些都是在架构上定义为不影响相邻字节.但是他们在硬件、软件中实现了这一点不需要关心正确性.即使是 MIPS 的第一个版本(1983 年)也有字节和半字加载/存储,这是一个非常面向字的 ISA.

All modern (non-DSP) architectures except early Alpha AXP have byte store and load instructions, and AFAIK these are all architecturally defined to not affect neighbouring bytes. However they accomplish that in hardware, software doesn't need to care about correctness. Even the very first version of MIPS (in 1983) had byte and half-word loads/stores, and it's a very word-oriented ISA.

然而,他实际上并没有声称大多数现代硬件需要任何特殊的编译器支持来实现 C++ 内存模型的这一部分,只是某些可能.也许他真的只是在第 2 段中谈论可字寻址的 DSP(其中 C 和 C++ 实现通常使用 16 位或 32 位 char,这正是 Stroustrup 所说的那种编译器变通方法.)

However, he doesn't actually claim that most modern hardware needs any special compiler support to implement this part of the C++ memory model, just that some might. Maybe he really is only talking about word-addressable DSPs in that 2nd paragraph (where C and C++ implementations often use 16 or 32-bit char as exactly the kind of compiler workaround Stroustrup was talking about.)

大多数现代"CPU(包括所有 x86)都有 L1D 缓存.他们将获取整个缓存行(通常为 64 字节)并在每个缓存行的基础上跟踪脏/非脏.因此,两个相邻字节与两个相邻字几乎完全相同,如果它们都在同一缓存行中. 写入一个字节或一个字将导致获取整行,并最终整行的回写.请参阅 Ulrich Drepper 的 每个程序员应该了解的关于内存的内容.你是对的,MESI(或像 MESIF/MOESI 这样的衍生物)确保这不是一个问题.(但同样,这是因为硬件实现了一个健全的内存模型.)

Most "modern" CPUs (including all x86) have an L1D cache. They will fetch whole cache lines (typically 64 bytes) and track dirty / not-dirty on a per-cache-line basis. So two adjacent bytes are pretty much exactly the same as two adjacent words, if they're both in the same cache line. Writing one byte or word will result in a fetch of the whole line, and eventually a write-back of the whole line. See Ulrich Drepper's What Every Programmer Should Know About Memory. You're correct that MESI (or a derivative like MESIF/MOESI) makes sure this isn't a problem. (But again, this is because hardware implements a sane memory model.)

当线路处于修改状态(MESI)时,商店只能提交到 L1D 缓存.因此,即使内部硬件实现对字节来说很慢并且需要额外的时间将字节合并到缓存行中的包含字中,只要它不允许,它实际上是一个原子读修改写要在读取和写入之间失效并重新获取的行.(虽然这个缓存有处于修改状态的行,但没有其他缓存可以有一个有效的复制).请参阅@old_timer 的评论提出同样的观点(但也适用于内存控制器中的 RMW).

A store can only commit to L1D cache while the line is in the Modified state (of MESI). So even if the internal hardware implementation is slow for bytes and takes extra time to merge the byte into the containing word in the cache line, it's effectively an atomic read modify write as long as it doesn't allow the line to be invalidated and re-acquired between the read and the write. (While this cache has the line in Modified state, no other cache can have a valid copy). See @old_timer's comment making the same point (but also for RMW in a memory controller).

这比例如更容易来自寄存器的原子 xchgadd 也需要 ALU 和寄存器访问,因为所有涉及的硬件都在同一个流水线阶段,这可以简单地停止额外的循环或两个.这显然对性能不利,并且需要额外的硬件来允许该管道阶段发出信号,表明它正在停止.这并不一定与 Stroustrup 的第一个主张相冲突,因为他谈论的是一个没有内存模型的假设 ISA,但这仍然是一个延伸.

This is easier than e.g. an atomic xchg or add from a register that also needs an ALU and register access, since all the HW involved is in the same pipeline stage, which can simply stall for an extra cycle or two. That's obviously bad for performance and takes extra hardware to allow that pipeline stage to signal that it's stalling. This doesn't necessarily conflict with Stroustrup's first claim, because he was talking about a hypothetical ISA without a memory model, but it's still a stretch.

在单核微控制器上,用于缓存字节存储的内部字 RMW 会更合理,因为不会有来自其他内核的无效请求,它们必须在原子 RMW 缓存期间延迟响应- 词更新.但这对不可缓存区域的 I/O 没有帮助.我说微控制器是因为其他单核 CPU 设计通常支持某种多插槽 SMP.

On a single-core microcontroller, internal word-RMW for cached byte stores would be more plausible, since there won't be Invalidate requests coming in from other cores that they'd have to delay responding to during an atomic RMW cache-word update. But that doesn't help for I/O to uncacheable regions. I say microcontroller because other single-core CPU designs typically support some kind of multi-socket SMP.

许多 RISC ISA 不支持单条指令的未对齐字加载/存储,但这是一个单独的问题(困难在于处理加载跨越两个缓存行甚至页面的情况,这不可能发生在字节或对齐的半字).不过,越来越多的 ISA 在最近的版本中添加了对未对齐加载/存储的保证支持.(例如 MIPS32/64 Release 6 在 2014 年,我认为 AArch64 和最近的 32-位ARM).

Many RISC ISAs don't support unaligned-word loads/stores with a single instruction, but that's a separate issue (the difficulty is handling the case when a load spans two cache lines or even pages, which can't happen with bytes or aligned half-words). More and more ISAs are adding guaranteed support for unaligned load/store in recent versions, though. (e.g. MIPS32/64 Release 6 in 2014, and I think AArch64 and recent 32-bit ARM).

Stroustrup 的书的第 4 版于 2013 年出版,当时 Alpha 已经去世多年.第一版是 1985 年出版,当时 RISC 是新的大创意(例如斯坦福 MIPS 于 1983, 根据维基百科计算硬件的时间线,但现代" 当时的 CPU 是通过字节存储进行字节寻址的.Cyber​​ CDC 6600 是可字寻址的,可能仍然存在,但不能称为现代.

The 4th edition of Stroustrup's book was published in 2013 when Alpha had been dead for years. The first edition was published in 1985, when RISC was the new big idea (e.g. Stanford MIPS in 1983, according to Wikipedia's timeline of computing HW, but "modern" CPUs at that time were byte-addressable with byte stores. Cyber CDC 6600 was word-addressable and probably still around, but couldn't be called modern.

即使是非常面向文字的 RISC 机器,例如 MIPSSPARC 具有字节存储和字节加载(带符号或零扩展)指令.它们不支持未对齐的字加载,简化了缓存(或内存访问,如果没有缓存)和加载端口,但您可以用一条指令加载任何单个字节,更重要的是存储一个字节没有对周围字节进行任何架构上可见的非原子重写.(虽然缓存存储可以

Even very word-oriented RISC machines like MIPS and SPARC have byte store and byte load (with sign or zero extension) instructions. They don't support unaligned word loads, simplifying the cache (or memory access if there is no cache) and load ports, but you can load any single byte with one instruction, and more importantly store a byte without any architecturally-visible non-atomic rewrite of the surrounding bytes. (Although cached stores can

我想如果针对没有字节存储的 Alpha ISA 版本,Alpha 上的 C++11(它为语言引入了线程感知内存模型)需要使用 32 位 char.或者,当它无法证明没有其他线程可以拥有允许它们写入相邻字节的指针时,它必须使用带有 LL/SC 的软件 atomic-RMW.

I suppose C++11 (which introduces a thread-aware memory model to the language) on Alpha would need to use 32-bit char if targeting a version of the Alpha ISA without byte stores. Or it would have to use software atomic-RMW with LL/SC when it couldn't prove that no other threads could have a pointer that would let them write neighbouring bytes.

IDK 如何缓慢的字节加载/存储指令在任何 CPU 中都可以在硬件中实现,但不如字加载/存储便宜.只要您使用 movzx/movsx 来避免部分注册错误依赖项或合并停顿,字节加载在 x86 上就很便宜.在 AMD pre-Ryzen 上,movsx/movzx 需要额外的 ALU uop,但在 Intel 和 AMD CPU 上的加载端口中处理零/符号扩展.) x86 的主要缺点是您需要单独的加载指令,而不是使用内存操作数作为 ALU 指令的来源(如果您将零扩展字节添加到 32 位整数),节省前端 uop 吞吐量带宽和代码大小.或者,如果您只是向字节寄存器添加一个字节,则 x86 基本上没有缺点.无论如何,RISC 加载存储 ISA 总是需要单独的加载和存储指令.x86 字节存储并不比 32 位存储更昂贵.

IDK how slow byte load/store instructions are in any CPUs where they're implemented in hardware but not as cheap as word loads/stores. Byte loads are cheap on x86 as long as you use movzx/movsx to avoid partial-register false dependencies or merging stalls. On AMD pre-Ryzen, movsx/movzx needs an extra ALU uop, but otherwise zero/sign extension is handled right in the load port on Intel and AMD CPUs.) The main x86 downside is that you need a separate load instruction instead of using a memory operand as a source for an ALU instruction (if you're adding a zero-extended byte to a 32-bit integer), saving front-end uop throughput bandwidth and code-size. Or if you're just adding a byte to a byte register, there's basically no downside on x86. RISC load-store ISAs always need separate load and store instructions anyway. x86 byte stores are no more expensive that 32-bit stores.

作为一个性能问题,对于具有慢字节存储的硬件,一个好的 C++ 实现可能会将每个 char 放在它自己的字中,并尽可能使用字加载/存储(例如,对于结构外部的全局变量,以及堆栈上的局部变量).IDK,如果 MIPS/ARM/任何真正的实现都有缓慢的字节加载/存储,但如果是这样,也许 gcc 有 -mtune= 选项来控制它.

As a performance issue, a good C++ implementation for hardware with slow byte stores might put each char in its own word and use word loads/stores whenever possible (e.g. for globals outside structs, and for locals on the stack). IDK if any real implementations of MIPS / ARM / whatever have slow byte load/store, but if so maybe gcc has -mtune= options to control it.

这对<没有帮助code>char[],或者当您不知道它可能指向哪里时取消引用 char *.(这包括用于 MMIO 的 volatile char*.)因此,让编译器+链接器将 char 变量放在单独的单词中并不是一个完整的解决方案,只是一个如果真正的字节存储速度很慢,则会导致性能下降.

That doesn't help for char[], or dereferencing a char * when you don't know where it might be pointing. (This includes volatile char* which you'd use for MMIO.) So having the compiler+linker put char variables in separate words isn't a complete solution, just a performance hack if true byte stores are slow.

PS:有关 Alpha 的更多信息:

Alpha 有趣的原因有很多:为数不多的全新 64 位 ISA 之一,而不是现有 32 位 ISA 的扩展.并且是最近的全新 ISA 之一,安腾是几年后的另一个,它尝试了一些简洁的 CPU 架构想法.

Alpha is interesting for a lot of reasons: one of the few clean-slate 64-bit ISAs, not an extension to an existing 32-bit ISA. And one of the more recent clean-slate ISAs, Itanium being another from several years later which attempted some neat CPU-architecture ideas.

来自 Linux Alpha HOWTO.

当引入 Alpha 架构时,它在 RISC 架构中是独一无二的,可以避免 8 位和 16 位加载和存储.它支持 32 位和 64 位加载和存储(长字和四字,在 Digital 的命名法中).联合架构师(Dick Sites、Rich Witek)通过引用优点来证明这一决定是合理的:

When the Alpha architecture was introduced, it was unique amongst RISC architectures for eschewing 8-bit and 16-bit loads and stores. It supported 32-bit and 64-bit loads and stores (longword and quadword, in Digital's nomenclature). The co-architects (Dick Sites, Rich Witek) justified this decision by citing the advantages:

  1. 缓存和内存子系统中的字节支持往往会减慢对 32 位和 64 位数量的访问.
  2. 字节支持使得在缓存/内存子系统中构建高速纠错电路变得困难.

Alpha 通过提供用于操作 64 位寄存器中的字节和字节组的强大指令来进行补偿.字符串操作的标准基准测试(例如,某些字节基准测试)表明 Alpha 在字节操作方面表现非常出色.

Alpha compensates by providing powerful instructions for manipulating bytes and byte groups within 64-bit registers. Standard benchmarks for string operations (e.g., some of the Byte benchmarks) show that Alpha performs very well on byte manipulation.

这篇关于现代 x86 硬件不能将单个字节存储到内存中吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆