现代x86硬件可以不将单个字节存储到内存吗? [英] Can modern x86 hardware not store a single byte to memory?

查看：94 发布时间：2020/9/12 21:20:58 c++ assembly concurrency x86 memory-model

本文介绍了现代x86硬件可以不将单个字节存储到内存吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

谈到C ++的并发存储模型，是Stroustrup的 C ++编程语言，第4版. 41.2.1，说:

Speaking of the memory model of C++ for concurrency, Stroustrup's C++ Programming Language, 4th ed., sect. 41.2.1, says:

...(像大多数现代硬件一样)，机器无法加载或存储小于一个字的任何内容.

... (like most modern hardware) the machine could not load or store anything smaller than a word.

但是，我的x86处理器已经使用了几年，它可以并且确实存储小于一个字的对象.例如:

However, my x86 processor, a few years old, can and does store objects smaller than a word. For example:

#include <iostream>
int main()
{
    char a =  5;
    char b = 25;
    a = b;
    std::cout << int(a) << "\n";
    return 0;
}

未经优化，GCC会将其编译为:

Without optimization, GCC compiles this as:

        [...]
        movb    $5, -1(%rbp)   # a =  5, one byte
        movb    $25, -2(%rbp)  # b = 25, one byte
        movzbl  -2(%rbp), %eax # load b, one byte, not extending the sign
        movb    %al, -1(%rbp)  # a =  b, one byte
        [...]

评论是由我本人提出的，而大会是由GCC提出的.当然，它运行良好.

The comments are by me but the assembly is by GCC. It runs fine, of course.

很显然，当Stroustrup解释说硬件可以加载和存储的任何内容都不超过一个字时，我不明白他在说什么.据我所知，我的程序除了加载并存储小于一个字的对象外，什么都不做.

Obviously, I do not understand what Stroustrup is talking about when he explains that hardware can load and store nothing smaller than a word. As far as I can tell, my program does nothing but load and store objects smaller than a word.

C ++对零成本，对硬件友好的抽象的深入关注使C ++与其他易于掌握的编程语言脱颖而出.因此，如果Stroustrup在公交车上有一个有趣的信号心理模型，或者有其他类似的东西，那么我想了解Stroustrup的模型.

The thoroughgoing focus of C++ on zero-cost, hardware-friendly abstractions sets C++ apart from other programming languages that are easier to master. Therefore, if Stroustrup has an interesting mental model of signals on a bus, or has something else of this kind, then I would like to understand Stroustrup's model.

请问Stroustrup在说什么?

What is Stroustrup talking about, please?

具有上下文的更大报价

在更完整的上下文中，这是Stroustrup的报价:

Here is Stroustrup's quote in fuller context:

请考虑一下，如果链接程序在内存中的同一单词中分配了[c1>和b类型的变量[c1]和b)，并且该计算机无法加载或存储小于如果没有一个明确定义且合理的内存模型，线程1可能会读取包含b和c的单词，更改c，然后将该单词写回到内存中.同时，线程2可以对b执行相同的操作.然后，无论哪个线程首先读取该单词，然后哪个线程最后将其结果写回到内存中，都将确定结果.

Consider what might happen if a linker allocated [variables of char type like] c and b in the same word in memory and (like most modern hardware) the machine could not load or store anything smaller than a word.... Without a well-defined and reasonable memory model, thread 1 might read the word containing b and c, change c, and write the word back into memory. At the same time, thread 2 could do the same with b. Then, whichever thread managed to read the word first and whichever thread managed to write its result back into memory last would determine the result....

其他备注

我不相信Stroustrup在谈论缓存行.就我所知，即使他是缓存一致性协议，也可以透明地处理该问题，除非在硬件I/O期间.

I do not believe that Stroustrup is talking about cache lines. Even if he were, as far as I know, cache coherency protocols would transparently handle that problem except maybe during hardware I/O.

我已经检查了处理器的硬件数据表.从电气上讲，我的处理器(一个英特尔Ivy Bridge)似乎通过某种16位多路复用方案来寻址DDR3L内存，所以我不知道这是怎么回事.不过，我不清楚这与Stroustrup的观点有多大关系.

I have checked my processor's hardware datasheet. Electrically, my processor (an Intel Ivy Bridge) seems to address DDR3L memory by some sort of 16-bit multiplexing scheme, so I don't know what that's about. It is not clear to me that that has much to do with Stroustrup's point, though.

斯特鲁斯特鲁普是一个聪明的人，也是一位杰出的科学家，所以我毫不怀疑他正在做些明智的事情.我很困惑.

Stroustrup is a smart man and an eminent scientist, so I do not doubt that he is taking about something sensible. I am confused.

另请参见此问题.我的问题在某些方面类似于链接的问题，对链接的问题的回答是在这里也有帮助.但是，我的问题还涉及促使C ++成为现状的硬件/总线模型，并导致Stroustrup编写他所写的内容.我不仅要就C ++标准所正式保证的问题寻求答案，还希望了解C ++标准为何要保证这一点.基本思想是什么?这也是我的问题的一部分.

See also this question. My question resembles the linked question in several ways, and the answers to the linked question are also helpful here. However, my question goes also to the hardware/bus model that motivates C++ to be the way it is and that causes Stroustrup to write what he writes. I do not seek an answer merely regarding that which the C++ standard formally guarantees, but also wish to understand why the C++ standard would guarantee it. What is the underlying thought? This is part of my question, too.

推荐答案

TL:DR:在每个具有字节存储指令(包括x86)的现代ISA上，它们都是原子的，不会干扰周围的字节.(我不知道任何较旧的ISA都可以通过字节存储指令发明写入"到相邻字节.)

TL:DR: On every modern ISA that has byte-store instructions (including x86), they're atomic and don't disturb surrounding bytes. (I'm not aware of any older ISAs where byte-store instructions could "invent writes" to neighbouring bytes either.)

实际实施机制(

The actual implementation mechanism (in non-x86 CPUs) is sometimes an internal RMW cycle to modify a whole word in a cache line, but that's done "invisibly" inside a core while it has exclusive ownership of the cache line so it's only ever a performance problem, not correctness. (And merging in the store buffer can sometimes turn byte-store instructions into an efficient full-word commit to L1d cache.)

我认为这不是一个非常准确，清晰或有用的陈述.准确地说，现代CPU无法加载或存储小于高速缓存行的任何内容. (尽管对于不可缓存的内存区域(例如MMIO)并非如此.)

I don't think it's a very accurate, clear or useful statement. It would be more accurate to say that modern CPUs can't load or store anything smaller than a cache line. (Although that's not true for uncacheable memory regions, e.g. for MMIO.)

可能只是做一个关于内存模型的假设示例，而不是暗示真正的硬件就是这样.但是，如果我们尝试一下，我们可能会发现一个解释并没有那么明显或完全错误，这可能是Stroustrup在写这篇介绍内存模型主题时所想的. (很抱歉，这个答案很长；我在猜测他可能意味着什么以及有关相关主题的过程中写了很多篇文章……)

It probably would have been better just to make a hypothetical example to talk about memory models, rather than implying that real hardware is like this. But if we try, we can maybe find an interpretation that isn't as obviously or totally wrong, which might have been what Stroustrup was thinking when he wrote this to introduce the topic of memory models. (Sorry this answer is so long; I ended up writing a lot while guessing what he might have meant and about related topics...)

或者这可能是另一种情况，即高级语言设计师不是硬件专家，或者至少偶尔会发表错误的陈述.

Or maybe this is another case of high-level language designers not being hardware experts, or at least occasionally making mis-statements.

我认为Stroustrup在谈论CPU在内部如何工作以实现字节存储指令.他建议没有明确且合理的内存模型的CPU 可以在缓存行中为包含字的非原子RMW或在没有缓存的CPU的内存中实现字节存储

I think Stroustrup is talking about how CPUs work internally to implement byte-store instructions. He's suggesting that a CPU without a well-defined and reasonable memory model might implement a byte-store with a non-atomic RMW of the containing word in a cache line, or in memory for a CPU without cache.

即使对于高性能(x86 CPU)对于内部(外部不可见)行为的这种虚假说法也不是正确的.现代的Intel CPU不会对字节存储，甚至是未对齐的字或向量存储(不跨越高速缓存行边界)的吞吐量造成损失. AMD就是这样.

Even this weaker claim about internal (not externally visible) behaviour is not true for high-performance x86 CPUs. Modern Intel CPUs have no throughput penalty for byte stores, or even unaligned word or vector stores that don't cross a cache-line boundary. AMD is similar.

如果字节存储区或未对齐存储区必须执行RMW周期，因为存储区已提交给L1D高速缓存，则它将以我们可以使用性能计数器进行测量的方式来干扰存储区和/或加载指令/uop吞吐量. (在经过精心设计的实验中，避免了在提交到L1d高速缓存之前隐藏存储缓冲区中存储合并的可能性，因为存储执行单元在当前CPU上每个时钟只能运行1个存储.)

If byte or unaligned stores had to do a RMW cycle as the store committed to L1D cache, it would interfere with store and/or load instruction/uop throughput in a way we could measure with performance counters. (In a carefully designed experiment that avoids the possibility of store coalescing in the store buffer before commit to L1d cache hiding the cost, because the store execution unit(s) can only run 1 store per clock on current CPUs.)
但是，一些针对非x86 ISA的高性能设计确实使用原子RMW周期在内部将存储提交到L1d缓存. 有没有现代的CPU缓存的字节存储实际上比字存储的速度慢? 缓存行始终始终处于MESI Exclusive/Modified状态，因此不会带来任何正确性问题，仅对性能造成很小的影响.这与可能在其他CPU上增加存储量的操作有很大的不同. (以下关于不会发生的争论仍然适用，但是我的更新可能遗漏了一些仍然认为原子高速缓存-RMW不太可能的东西.)

However, some high performance designs for non-x86 ISAs do use an atomic RMW cycle to internally commit stores to L1d cache. Are there any modern CPUs where a cached byte store is actually slower than a word store? The cache line stays in MESI Exclusive/Modified state the whole time, so it can't introduce any correctness problems, only a small performance hit. This is very different from doing something that could step on stores from other CPUs. (The arguments below about that not happening still apply, but my update may have missed some stuff that still argues that atomic cache-RMW is unlikely.)

(在许多非x86 ISA上，根本不支持未对齐的存储，或者与x86软件相比，未对齐的存储使用得更少.而且，排序较弱的ISA允许在存储缓冲区中进行更多的合并，因此实际生成的字节存储指令不多如果没有这些花哨的(耗电的)高速缓存访问硬件的动机，那么散乱字节存储的RMW字在某些设计中是可以接受的折衷方案.)

(On many non-x86 ISAs, unaligned stores are not supported at all, or are used more rarely than in x86 software. And weakly-ordered ISAs allow more coalescing in store buffers, so not as many byte store instructions actually result in single-byte commit to L1d. Without these motivations for fancy (power hungry) cache-access hardware, word RMW for scattered byte stores is an acceptable tradeoff in some designs.)

Alpha AXP 是1992年的高性能RISC设计，著名地(在现代非DSP ISA中是唯一的)省略了字节加载/存储指令，直到 1996年的Alpha 21164A(EV56).显然，他们不认为word-RMW是实现字节存储的可行选择，因为仅实现32位和64位对齐存储所提到的优势之一是L1D缓存的更高效ECC. 传统的SECDED ECC在32-位颗粒(开销为22％)比4位多了8位(开销为50％)."(@ Paul A. Clayton关于字与字节寻址的回答还有其他一些有趣的计算机体系结构方面的内容.)如果字节存储是通过word-RMW实现的，您仍然可以使用单词粒度进行错误检测/纠正.

Alpha AXP, a high-performance RISC design from 1992, famously (and uniquely among modern non-DSP ISAs) omitted byte load/store instructions until Alpha 21164A (EV56) in 1996. Apparently they didn't consider word-RMW a viable option for implementing byte stores, because one of the cited advantages for implementing only 32-bit and 64-bit aligned stores was more efficient ECC for the L1D cache. "Traditional SECDED ECC would require 7 extra bits over 32-bit granules (22% overhead) versus 4 extra bits over 8-bit granules (50% overhead)." (@Paul A. Clayton's answer about word vs. byte addressing has some other interesting computer-architecture stuff.) If byte stores were implemented with word-RMW, you could still do error detection/correction with word-granularity.

因此，当前的Intel CPU仅在L1D中使用奇偶校验(而不是ECC).参见此关于硬件的问题(&)(不是)消除静默存储":在写之前检查高速缓存的旧内容，以免将行标记为脏(如果匹配的话)将需要RMW而不是仅存储，这是一个主要问题障碍.

Current Intel CPUs only use parity (not ECC) in L1D for this reason. See this Q&A about hardware (not) eliminating "silent stores": checking the old contents of cache before the write to avoid marking the line dirty if it matched would require a RMW instead of just a store, and that's a major obstacle.

事实证明，一些高性能的流水线设计确实使用了原子字-RMW来提交L1d，尽管它使内存流水线停滞了，但是(如我在下面所述)，对RAM执行外部可见的RMW.

It turns out some high-perf pipelined designs do use atomic word-RMW to commit to L1d, despite it stalling the memory pipeline, but (as I argue below) it's much less likely that any do an externally-visible RMW to RAM.

Word-RMW不适用于 Alpha的稀疏I/O空间，其中字加载/存储被映射到字节加载/存储，因此它可以使用商用PCI卡，而不需要没有字节IO寄存器的特殊硬件.

Word-RMW isn't a useful option for MMIO byte stores either, so unless you have an architecture that doesn't need sub-word stores for IO, you'd need some kind of special handling for IO (like Alpha's sparse I/O space where word load/stores were mapped to byte load/stores so it could use commodity PCI cards instead of needing special hardware with no byte IO registers).

为 @Margaret指出，DDR3存储器控制器可以通过设置控制信号来掩蔽突发的其他字节，从而进行字节存储.将这些信息获取到内存控制器(对于未缓存的存储)的相同机制也可以使该信息与装入或存储一起传递到MMIO空间.因此，有一些硬件机制可以真正做到一个字节存储，即使是在面向突发的存储系统上，现代CPU也很可能会使用它而不是实现RMW，因为它可能更简单，并且对MMIO的正确性很多.

As @Margaret points out, DDR3 memory controllers can do byte stores by setting control signals that mask out other bytes of a burst. The same mechanisms that get this information to the memory controller (for uncached stores) could also get that information passed along with a load or store to MMIO space. So there are hardware mechanisms for really doing a byte store even on burst-oriented memory systems, and it's highly likely that modern CPUs will use that instead of implementing an RMW, because it's probably simpler and is much better for MMIO correctness.

How many and what size cycles will be needed to perform longword transferred to the CPU shows how a ColdFire microcontroller signals the transfer size (byte/word/longword/16-byte line) with external signal lines, letting it do byte loads/stores even if 32-bit-wide memory was hooked up to its 32-bit data bus. Something like this is presumably typical for most memory bus setups (but I don't know). The ColdFire example is complicated by also being configurable to use 16 or 8-bit memory, taking extra cycles for wider transfers. But nevermind that, the important point is that it has external signaling for the transfer size, to tell the memory HW which byte it's actually writing.

Stroustrup的

Stroustrup's next paragraph is

"C ++内存模型保证两个执行线程可以更新和访问单独的内存位置而不会互相干扰，这正是我们天真的期望.编译器的工作是保护我们免受现代硬件有时非常奇怪和微妙的行为的影响.编译器和硬件组合如何实现取决于编译器."

"The C++ memory model guarantees that two threads of execution can update and access separate memory locations without interfering with each other. This is exactly what we would naively expect. It is the compiler’s job to protect us from the sometimes very strange and subtle behaviors of modern hardware. How a compiler and hardware combination achieves that is up to the compiler. ..."

因此，显然他认为真正的现代硬件可能无法提供安全"字节加载/存储.设计硬件内存模型的人员与C/C ++人员一致，并意识到字节存储指令对程序员/编译器而言，如果可以踩到相邻的字节，将不会很有用.

So apparently he thinks that real modern hardware may not provide "safe" byte load/store. The people who design hardware memory models agree with the C/C++ people, and realize that byte store instructions would not be very useful to programmers / compilers if they could step on neighbouring bytes.

除早期的Alpha AXP之外，所有现代(非DSP)架构都具有字节存储和加载指令，而AFAIK在结构上均定义为不影响相邻字节.但是，它们是在硬件，软件中完成的不需要关心正确性.即使是MIPS的第一个版本(在1983年)也具有字节和半字的加载/存储，并且它是面向字的ISA.

All modern (non-DSP) architectures except early Alpha AXP have byte store and load instructions, and AFAIK these are all architecturally defined to not affect neighbouring bytes. However they accomplish that in hardware, software doesn't need to care about correctness. Even the very first version of MIPS (in 1983) had byte and half-word loads/stores, and it's a very word-oriented ISA.

但是，他实际上并没有声称大多数现代硬件都需要任何特殊的编译器支持来实现C ++内存模型的这一部分，而可能只是 some .也许他真的只是在第二段中谈论的是字可寻址的DSP(C和C ++实现经常使用16位或32位char正是Stroustrup所谈论的那种编译器解决方法.)

However, he doesn't actually claim that most modern hardware needs any special compiler support to implement this part of the C++ memory model, just that some might. Maybe he really is only talking about word-addressable DSPs in that 2nd paragraph (where C and C++ implementations often use 16 or 32-bit char as exactly the kind of compiler workaround Stroustrup was talking about.)

大多数现代" CPU(包括所有x86)都具有L1D缓存.它们将获取整个缓存行(通常为64个字节)，并在每个缓存行的基础上跟踪脏/非脏行. 因此，如果两个相邻字节都在同一缓存行中，则它们与两个相邻字几乎完全相同.写入一个字节或字将导致整个行的获取，并最终整行的回写.请参阅Ulrich Drepper的每个程序员应该了解的内存.您是正确的， MESI (或类似MESIF/MOESI的派生词)可以确保这是'这是一个问题. (但这又是因为硬件实现了合理的内存模型.)

Most "modern" CPUs (including all x86) have an L1D cache. They will fetch whole cache lines (typically 64 bytes) and track dirty / not-dirty on a per-cache-line basis. So two adjacent bytes are pretty much exactly the same as two adjacent words, if they're both in the same cache line. Writing one byte or word will result in a fetch of the whole line, and eventually a write-back of the whole line. See Ulrich Drepper's What Every Programmer Should Know About Memory. You're correct that MESI (or a derivative like MESIF/MOESI) makes sure this isn't a problem. (But again, this is because hardware implements a sane memory model.)

仅当行处于(MESI的)已修改状态时，存储才能提交到L1D高速缓存.因此，即使内部硬件实现的字节速度很慢，并且需要花费额外的时间将字节合并到高速缓存行中的包含字中，只要它不允许，它实际上是 atomic 读取，修改和写入操作在读取和写入之间要无效并重新获取的行. (虽然此缓存的行处于修改"状态，但其他任何缓存都不能拥有有效的缓存复制).参见 @old_timer的评论提出相同的观点(但对于内存控制器中的RMW也是如此).

A store can only commit to L1D cache while the line is in the Modified state (of MESI). So even if the internal hardware implementation is slow for bytes and takes extra time to merge the byte into the containing word in the cache line, it's effectively an atomic read modify write as long as it doesn't allow the line to be invalidated and re-acquired between the read and the write. (While this cache has the line in Modified state, no other cache can have a valid copy). See @old_timer's comment making the same point (but also for RMW in a memory controller).

这比例如寄存器中的原子xchg或add，也需要ALU和寄存器访问权限，因为涉及的所有硬件都在同一个流水线阶段，可以简单地停顿一个或两个额外的周期.这显然对性能不利，并且需要额外的硬件以允许该管道阶段表明其正在停止.这不一定与Stroustrup的第一个主张相抵触，因为他说的是假设的ISA，而没有内存模型，但这仍然是一个难题.

This is easier than e.g. an atomic xchg or add from a register that also needs an ALU and register access, since all the HW involved is in the same pipeline stage, which can simply stall for an extra cycle or two. That's obviously bad for performance and takes extra hardware to allow that pipeline stage to signal that it's stalling. This doesn't necessarily conflict with Stroustrup's first claim, because he was talking about a hypothetical ISA without a memory model, but it's still a stretch.

在单核微控制器上，用于缓存的字节存储的内部word-RMW更加合理，因为在原子级RMW缓存期间，不会有来自其他内核的Invalidate请求(它们不得不延迟响应) -单词更新.但这对无法缓存的区域的I/O没有帮助.我之所以说微控制器，是因为其他单核CPU设计通常支持某种多路SMP.

On a single-core microcontroller, internal word-RMW for cached byte stores would be more plausible, since there won't be Invalidate requests coming in from other cores that they'd have to delay responding to during an atomic RMW cache-word update. But that doesn't help for I/O to uncacheable regions. I say microcontroller because other single-core CPU designs typically support some kind of multi-socket SMP.

许多RISC ISA不支持使用一条指令执行未对齐字的加载/存储，但这是一个单独的问题(难点是当加载跨越两个缓存行甚至页面时，这种情况不会发生)字节或对齐的半字).不过，越来越多的ISA在最近版本中增加了对未对齐的加载/存储的保证支持. (例如，2014年 MIPS32/64第6版，我认为AArch64和最新版本32位ARM).

Many RISC ISAs don't support unaligned-word loads/stores with a single instruction, but that's a separate issue (the difficulty is handling the case when a load spans two cache lines or even pages, which can't happen with bytes or aligned half-words). More and more ISAs are adding guaranteed support for unaligned load/store in recent versions, though. (e.g. MIPS32/64 Release 6 in 2014, and I think AArch64 and recent 32-bit ARM).

Stroustrup的书的第四版于2013年出版，当时阿尔法已经去世多年. .第一版是于1985年发布，当时RISC是一个新的大想法(例如Stanford MIPS 1983年，根据Wikipedia计算硬件的时间表，但是当时的现代" CPU可以通过字节存储进行字节寻址.CyberCDC 6600可以进行字寻址，并且可能仍在使用中，但不能称为现代CPU.

The 4th edition of Stroustrup's book was published in 2013 when Alpha had been dead for years. The first edition was published in 1985, when RISC was the new big idea (e.g. Stanford MIPS in 1983, according to Wikipedia's timeline of computing HW, but "modern" CPUs at that time were byte-addressable with byte stores. Cyber CDC 6600 was word-addressable and probably still around, but couldn't be called modern.

即使是非常注重文字的RISC机器，例如 MIPS 和 SPARC 具有字节存储区和字节加载(带符号或零扩展名)指令.它们不支持不对齐字加载，从而简化了缓存(如果没有缓存则简化了内存访问)和端口加载，但是您可以使用一条指令来加载任何单个字节，更重要的是 store 一个字节无需对周围字节进行任何架构可见的非原子重写. (尽管缓存的商店可以

Even very word-oriented RISC machines like MIPS and SPARC have byte store and byte load (with sign or zero extension) instructions. They don't support unaligned word loads, simplifying the cache (or memory access if there is no cache) and load ports, but you can load any single byte with one instruction, and more importantly store a byte without any architecturally-visible non-atomic rewrite of the surrounding bytes. (Although cached stores can

我想如果针对不带字节存储的Alpha ISA版本，则Alpha上的C ++ 11(向语言引入了线程感知的内存模型)将需要使用32位的char.否则，如果无法证明没有其他线程可以通过指针写入相邻字节，则它必须与LL/SC一起使用atomic-RMW软件.

I suppose C++11 (which introduces a thread-aware memory model to the language) on Alpha would need to use 32-bit char if targeting a version of the Alpha ISA without byte stores. Or it would have to use software atomic-RMW with LL/SC when it couldn't prove that no other threads could have a pointer that would let them write neighbouring bytes.

IDK 如何慢字节加载/存储指令在任何以硬件实现但不如字加载/存储便宜的CPU中.只要您使用movzx/movsx来避免部分注册错误的依赖关系或合并停顿，在x86上字节加载就很便宜. 在AMD pre-Ryzen上，movsx/movzx需要额外的ALU uop，但否则将处理零/符号扩展直接在Intel和AMD CPU的加载端口中.)x86的主要缺点是，您需要单独的加载指令，而不是使用内存操作数作为ALU指令的源(如果要添加零，扩展字节为32位整数)，从而节省了前端uop吞吐量带宽和代码大小.或者，如果您只是将一个字节添加到字节寄存器，则x86基本上没有任何缺点. RISC加载存储ISA始终始终需要单独的加载和存储指令. x86字节存储不再比32位存储昂贵.

IDK how slow byte load/store instructions are in any CPUs where they're implemented in hardware but not as cheap as word loads/stores. Byte loads are cheap on x86 as long as you use movzx/movsx to avoid partial-register false dependencies or merging stalls. On AMD pre-Ryzen, movsx/movzx needs an extra ALU uop, but otherwise zero/sign extension is handled right in the load port on Intel and AMD CPUs.) The main x86 downside is that you need a separate load instruction instead of using a memory operand as a source for an ALU instruction (if you're adding a zero-extended byte to a 32-bit integer), saving front-end uop throughput bandwidth and code-size. Or if you're just adding a byte to a byte register, there's basically no downside on x86. RISC load-store ISAs always need separate load and store instructions anyway. x86 byte stores are no more expensive that 32-bit stores.

作为一个性能问题，对于具有慢字节存储的硬件，良好的C ++实现可能会将每个char放入其自己的字中，并在可能的情况下使用字加载/存储(例如，针对结构外部的全局变量和堆栈上的局部变量) . IDK(如果有MIPS/ARM的任何实际实现)/慢速字节加载/存储，但如果是，则gcc可能具有-mtune=选项来控制它.

As a performance issue, a good C++ implementation for hardware with slow byte stores might put each char in its own word and use word loads/stores whenever possible (e.g. for globals outside structs, and for locals on the stack). IDK if any real implementations of MIPS / ARM / whatever have slow byte load/store, but if so maybe gcc has -mtune= options to control it.

这对char[] ，或者在不知道char *指向何处时取消引用. (这包括您要用于MMIO的volatile char*.)因此，让编译器+链接器将char变量放在单独的单词中并不是一个完整的解决方案，如果真正的字节存储速度很慢，则仅仅是性能上的麻烦.

That doesn't help for char[], or dereferencing a char * when you don't know where it might be pointing. (This includes volatile char* which you'd use for MMIO.) So having the compiler+linker put char variables in separate words isn't a complete solution, just a performance hack if true byte stores are slow.

PS:有关Alpha的更多信息:

Alpha之所以有趣，有很多原因:它是为数不多的64位ISA之一，而不是对现有32位ISA的扩展.而且是较新的ISA之一，Itanium是几年后的另一种，它尝试了一些简洁的CPU体系结构构想.

Alpha is interesting for a lot of reasons: one of the few clean-slate 64-bit ISAs, not an extension to an existing 32-bit ISA. And one of the more recent clean-slate ISAs, Itanium being another from several years later which attempted some neat CPU-architecture ideas.

来自 Linux Alpha HOWTO .

当引入Alpha架构时，它在RISC架构中是独一无二的，避免了8位和16位加载和存储.它支持32位和64位加载和存储(长字和四字，以Digital的术语命名).联合建筑师(Dick Sites，Rich Witek)通过列举以下优点证明了这一决定的正确性:

When the Alpha architecture was introduced, it was unique amongst RISC architectures for eschewing 8-bit and 16-bit loads and stores. It supported 32-bit and 64-bit loads and stores (longword and quadword, in Digital's nomenclature). The co-architects (Dick Sites, Rich Witek) justified this decision by citing the advantages:

缓存和内存子系统中的字节支持趋向于减慢对32位和64位数量的访问.

字节支持使得很难在高速缓存/内存子系统中构建高速纠错电路.

Alpha通过提供强大的指令来补偿64位寄存器中的字节和字节组，从而进行补偿.字符串操作的标准基准测试(例如某些Byte基准测试)表明Alpha在字节操作方面表现非常出色.

Alpha compensates by providing powerful instructions for manipulating bytes and byte groups within 64-bit registers. Standard benchmarks for string operations (e.g., some of the Byte benchmarks) show that Alpha performs very well on byte manipulation.

这篇关于现代x86硬件可以不将单个字节存储到内存吗?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

现代x86硬件可以不将单个字节存储到内存吗? [英] Can modern x86 hardware not store a single byte to memory?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

现代x86硬件可以不将单个字节存储到内存吗? [英] Can modern x86 hardware not store a single byte to memory?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭