处理器如何读取内存? [英] how does the processor read memory?

查看:119
本文介绍了处理器如何读取内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试重新实现malloc,我需要了解对齐的目的.据我了解,如果内存对齐,则代码将更快地执行,因为处理器无需采取额外的步骤来恢复被削减的内存位.我想我知道64位处理器可以读取64位乘64位内存.现在,让我们想象一下,我有一个具有顺序的结构(没有填充):一个char,一个short,一个char和一个int.为什么短路会错位?我们将所有数据都存储在块中!为什么它必须位于2的倍数的地址上?对于整数和其他类型也有相同的问题吗?

I'm trying to re-implement malloc and I need to understand the purpose of the alignment. As I understand it, if the memory is aligned, the code will be executed faster because the processor won't have to take an extra step to recover the bits of memory that are cut. I think I understand that a 64-bit processor reads 64-bit by 64-bit memory. Now, let's imagine that I have a structure with in order (without padding): a char, a short, a char, and an int. Why will the short be misaligned? We have all the data in the block! Why does it have to be on an address which is a multiple of 2. Same question for the integers and other types?

我还有第二个问题:利用前面提到的结构,处理器如何知道何时读取其64位时,前8位对应于char,然后接下来的16位对应于short等. ?

I also have a second question: With the structure I mentioned before, how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?

推荐答案

效果甚至可以包括正确性,而不仅仅是性能:C Undefined Behavior(UB),如果您有short对象,则可能导致段错误或其他不良行为不满足alignof(short). (在默认情况下,需要对加载/存储指令进行对齐的ISA会发生故障,例如SPARC和MIPS64r6之前的MIPS)

The effects can even include correctness, not just performance: C Undefined Behaviour (UB) leading to possible segfaults or other misbehaviour if you have a short object that doesn't satisfy alignof(short). (Faulting is expected on ISAs where load/store instructions require alignment by default, like SPARC, and MIPS before MIPS64r6)

或者,如果_Atomic int没有alignof(_Atomic int),则会破坏原子操作.

Or tearing of atomic operations if an _Atomic int doesn't have alignof(_Atomic int).

(通常为alignof(T) = sizeof(T),在任何给定的ABI中最大为一定尺寸,通常为宽度或更宽).

(Typically alignof(T) = sizeof(T) up to some size, often register width or wider, in any given ABI).

malloc应该使用 alignof(max_align_t) 返回内存,因为您没有有关分配使用方式的任何类型信息.

malloc should return memory with alignof(max_align_t) because you don't have any type info about how the allocation will be used.

对于小于sizeof(max_align_t)的分配,您可以返回仅自然对齐的内存(例如,将4字节的内存分配为4字节),因为您知道存储不能可用于对对齐要求更高的任何事物.

For allocations smaller than sizeof(max_align_t), you can return memory that's merely naturally aligned (e.g. a 4-byte allocation aligned by 4 bytes) if you want, because you know that storage can't be used for anything with a higher alignment requirement.

与动态对齐的alignas (16) int32_t foo等类似的过度对齐的东西需要使用特殊的分配器,例如C11 aligned_alloc.如果您要实现自己的分配器库,则可能要支持aligned_realloc和aligned_calloc,以填补ISO C没有明显原因留下的空白.

Over-aligned stuff like the dynamically-allocated equivalent of alignas (16) int32_t foo needs to use a special allocator like C11 aligned_alloc. If you're implementing your own allocator library, you probably want to support aligned_realloc and aligned_calloc, filling those gaps that ISO C leave for no apparent reason.

并确保您不要对分配大小不是对齐倍数的aligned_alloc实施脑残的ISO C ++ 17要求,以免失败.没有人希望分配器拒绝从16个字节的边界开始分配101个浮点数的分配器,或者更大的分配器以获取更好的透明大页. aligned_alloc函数要求

And make sure you don't implement the braindead ISO C++17 requirement for aligned_alloc to fail if the allocation size isn't a multiple of the alignment. Nobody wants an allocator that rejects an allocation of 101 floats starting on a 16-byte boundary, or much larger for better transparent hugepages. aligned_alloc function requirements and How to solve the 32-byte-alignment issue for AVX load/store operations?

我想我知道64位处理器可以通过64位内存读取64位

I think I understand that a 64-bit processor reads 64-bit by 64-bit memory

不.数据总线宽度和突发大小,以及加载/存储执行单元的最大宽度或实际使用的宽度,不必与整数寄存器的宽度相同,或者由CPU定义其位数. (并且在现代高性能CPU中通常没有,例如32位P5 Pentium具有64位总线;现代32位ARM具有执行原子64位访问的加载/存储对指令.)

Nope. Data bus width and burst size, and load/store execution unit max width or actually-used width, don't have to be the same as width of integer registers, or however the CPU defines its bitness. (And in modern high performance CPUs typically aren't. e.g. 32-bit P5 Pentium had a 64-bit bus; modern 32-bit ARM has load/store-pair instructions that do atomic 64-bit accesses.)

处理器将整个缓存行从DRAM/L3/L2缓存读取到L1d缓存;在现代x86上为64字节;在其他一些系统上为32个字节.

Processors read whole cache lines from DRAM / L3 / L2 cache into L1d cache; 64 bytes on modern x86; 32 bytes on some other systems.

当读取单个对象或数组元素时,它们从L1d缓存中读取元素宽度.例如uint16_t数组可能仅受益于2字节加载/存储的2字节边界对齐.

And when reading individual objects or array elements, they read from L1d cache with the element width. e.g. a uint16_t array may only benefit from alignment to a 2-byte boundary for 2-byte loads/stores.

或者,如果编译器使用SIMD对循环进行矢量化处理,则一次可以读取16个或32个字节uint16_t数组,即8个或16个元素的SIMD向量. (甚至在使用AVX512时为64).将数组与期望的矢量宽度对齐可能会有所帮助;当未对齐的SIMD加载/存储未越过缓存行边界时,它们可以在现代x86上快速运行.

Or if a compiler vectorizes a loop with SIMD, a uint16_t array can be read 16 or 32 bytes at a time, i.e. SIMD vectors of 8 or 16 elements. (Or even 64 with AVX512). Aligning arrays to the expected vector width can be helpful; unaligned SIMD load/store run fast on modern x86 when they don't cross a cache-line boundary.

高速缓存行拆分,尤其是页面拆分是现代x86因未对齐而减慢的地方;在高速缓存行中未对齐的晶体管通常不是因为它们花费晶体管来进行快速未对齐的加载/存储.其他一些ISA在任何未对齐情况下都会减慢速度,甚至出现故障,甚至在高速缓存行内也是如此.解决方案是相同的:给类型自然对齐:alignof(T)= sizeof(T).

Cache-line splits and especially page-splits are where modern x86 slows down from misalignment; unaligned within a cache line generally not because they spend the transistors for fast unaligned load/store. Some other ISAs slow down, and some even fault, on any misalignment, even within a cache line. The solution is the same: give types natural alignment: alignof(T) = sizeof(T).

在您的struct示例中,即使short未对齐,现代x86 CPU也不会受到任何影响.在任何普通的ABI中,alignof(int) = 4都在其中,所以整个结构都具有alignof(struct) = 4,因此char;short;char块从4字节边界开始.因此,short包含在单个4字节dword中,没有越过任何较宽的边界. AMD和Intel都以充分的效率来处理这一问题. (并且x86 ISA保证对它的访问在与P5 Pentium或更高版本兼容的CPU上是原子的,甚至是未缓存的:

In your struct example, modern x86 CPUs will have no penalty even though the short is misaligned. alignof(int) = 4 in any normal ABI, so the whole struct has alignof(struct) = 4, so the char;short;char block starts at a 4-byte boundary. Thus the short is contained within a single 4-byte dword, not crossing any wider boundary. AMD and Intel both handle this with full efficiency. (And the x86 ISA guarantees that accesses to it are atomic, even uncached, on CPUs compatible with P5 Pentium or later: Why is integer assignment on a naturally aligned variable atomic on x86?)

某些非x86 CPU可能会对未对齐的short处以罚款,或者必须使用其他指令. (由于您知道相对于对齐的32位块的对齐方式,因此对于负载,您可能会进行32位的加载和移位.)

Some non-x86 CPUs would have penalties for the misaligned short, or have to use other instructions. (Since you know the alignment relative to an aligned 32-bit chunk, for loads you'd probably do a 32-bit load and shift.)

所以是的,访问包含short的单个单词没有问题,但是问题是负载端口硬件将short提取并零扩展(或符号扩展)到完整寄存器中.这是x86花费晶体管使速度更快的地方. (有关此问题先前版本的 @Eric的答案所需的换挡.)

So yes there's no problem accessing one single word containing the short, but the problem is for load-port hardware to extract and zero-extend (or sign-extend) that short into a full register. This is where x86 spends the transistors to make this fast. (@Eric's answer on a previous version of this question goes into more detail about the shifting required.)

将未对齐的存储提交回缓存也是很简单的.例如,L1d缓存可能具有32位或64位块(我称为缓存字")中的ECC(针对位翻转的纠错).因此,由于该原因,只写一部分高速缓存字以及将其移动到要访问的高速缓存字内的任意字节边界都是一个问题. (在存储缓冲区中对相邻的窄存储区进行联合处理可以产生全宽度提交,从而避免了RMW周期来更新以这种方式处理窄存储区的单词的一部分).请注意,我之所以说单词",是因为我在谈论的硬件是面向单词的,而不是像现代x86那样围绕未对齐的加载/存储进行设计. 请参见

Committing an unaligned store back into cache is also non-trivial. For example, L1d cache might have ECC (error-correction against bit flips) in 32-bit or 64-bit chunks (which I'll call "cache words"). Writing only part of a cache word is thus a problem for that reason, as well as for shifting it to an arbitrary byte boundary within the cache word you want to access. (Coalescing of adjacent narrow stores in the store buffer can produce a full-width commit that avoids an RMW cycle to update part of a word, in caches that handle narrow stores that way). Note that I'm saying "word" now because I'm talking about hardware that's more word-oriented instead of being designed around unaligned loads/stores the way modern x86 is. See Are there any modern CPUs where a cached byte store is actually slower than a word store? (storing a single byte is only slightly simpler than an unaligned short)

(如果short跨越两个缓存字,则当然需要分开RMW周期,每个字节一个.)

(If the short spans two cache words, it would of course needs to separate RMW cycles, one for each byte.)

当然,由于alignof(short) = 2的简单原因,short是未对齐的,并且它违反了此ABI规则(假定具有此规则的ABI).因此,如果将指向它的指针传递给其他函数,则可能会遇到麻烦.尤其是在负载错位错误的CPU上,而不是由硬件处理这种情况,这种情况在运行时被证明是错位的.然后,您会得到类似为什么会出现未对齐访问的情况有时会在AMD64上映射错误的内存段?,其中GCC自动矢量化通过对2字节元素进行标量的倍数处理,预期会达到16字节的边界,因此违反ABI会导致x86出现段错误(通常可以容忍未对准.)

And of course the short is misaligned for the simple reason that alignof(short) = 2 and it violates this ABI rule (assuming an ABI that does have that). So if you pass a pointer to it to some other function, you could get into trouble. Especially on CPUs that have fault-on-misaligned loads, instead of hardware handling that case when it turns out to be misaligned at runtime. Then you can get cases like Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? where GCC auto-vectorization expected to reach a 16-byte boundary by doing some multiple of 2-byte elements scalar, so violating the ABI leads to a segfault on x86 (which is normally tolerant of misalignment.)

有关内存访问的完整详细信息,从DRAM RAS/CAS延迟到缓存带宽和对齐方式,请参见

For the full details on memory access, from DRAM RAS / CAS latency up to cache bandwidth and alignment, see What Every Programmer Should Know About Memory? It's pretty much still relevant / applicable

用途的内存对齐也是一个很好的答案. SO的标签.

Also Purpose of memory alignment has a nice answer. There are plenty of other good answers in SO's memory-alignment tag.

有关(某种程度上)现代英特尔加载/存储执行单元的更多详细信息,请参见:

For a more detailed look at (somewhat) modern Intel load/store execution units, see: https://electronics.stackexchange.com/questions/329789/how-can-cache-be-that-fast/329955#329955

当处理器读取其64位时,如何知道前8位对应于一个char,然后接下来的16位对应于short等...?

how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?

没有,除了运行说明以这种方式处理数据的事实.

在asm/机器代码中,所有内容都只是字节.每条指令均指定确切地处理哪些数据.在原始字节数组(主内存)之上,由编译器(或人工程序员)实现具有类型的变量以及C程序的逻辑.

In asm / machine-code, everything is just bytes. Every instruction specifies exactly what to do with which data. It's up to the compiler (or human programmer) to implement variables with types, and the logic of a C program, on top of a raw array of bytes (main memory).

我的意思是,在asm中,您可以运行所需的任何加载或存储指令,由您在正确的地址上使用正确的指令.您可以 将与两个相邻的int变量重叠的4个字节加载到浮点寄存器中,然后对其运行addss(单精度FP加法),CPU不会抱怨.但是您可能不想这样做,因为让CPU将这4个字节解释为IEEE754 binary32浮点数不太可能有意义.

What I mean by that is that in asm, you can run any load or store instruction you want to, and it's up to you to use the right ones on the right addresses. You could load 4 bytes that overlap two adjacent int variable into a floating-point register, then and run addss (single-precision FP add) on it, and the CPU won't complain. But you probably don't want to because making the CPU interpret those 4 bytes as an IEEE754 binary32 float is unlikely to be meaningful.

这篇关于处理器如何读取内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆