为什么未对齐的地址访问会导致2次或更多次访问? [英] Why misaligned address access incur 2 or more accesses?

查看:118
本文介绍了为什么未对齐的地址访问会导致2次或更多次访问?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么数据对齐是为了更有效地访问并简化CPU设计的正常答案。



一个相关的问题及其答案是此处。另一个来源是此处。但是他们俩都不能解决我的问题。



假设CPU的访问粒度为4个字节。这意味着CPU一次读取4个字节。上面列出的材料均表示,如果我访问未对齐的数据(例如地址0x1),则CPU必须进行2次访问(一个来自地址0x0、0x1、0x2和0x3,一个来自地址0x4、0x5、0x6和0x7)并结合结果。我不明白为什么。为什么当我发出访问地址0x1时,CPU为什么不能从0x1、0x2、0x3、0x4读取数据。它不会降低性能并不会引起电路复杂性的增加。



谢谢!

解决方案


这不会降低性能,并不会增加电路复杂性。


我们认为这是一个错误的假设,实际上使我们无法进一步理解。 / p>

您的评论在另一个问题中使用了更合适的措辞( 我认为不会降低 ...)


您是否认为内存架构使用了很多内存芯片为了最大程度地并行化带宽?而且,特定的数据项仅在一个芯片中,您不能只读取碰巧最方便的任何芯片,并期望它拥有所需的数据。


现在,可以将CPU和内存连接在一起,以便仅将0-7位连接到芯片0,将8-15连接到芯片1,将16-23连接到芯片2,将24-31连接到芯片3。对于所有整数N,存储位置4N存储在芯片0中,存储在芯片1中的4N + 1中,等等。这是每个芯片中的第N个字节。


让我们看一下存储在每个芯片每个偏移处的内存地址内存芯片

 
内存芯片0 1 2 3
偏移

0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
N 4N 4N + 1 4N + 2 4N + 3

因此,如果从内存中加载字节0-3,N = 0,则每个芯片都报告其内部字节0,所有位都以正确的位置结尾,所有内容很好。


现在,如果您尝试从内存位置1开始加载单词,会发生什么情况?


首先,我们看一下方式完成了。第一个存储字节1-3(存储在偏移量为0的存储芯片1-3中)最终以8-31位结束,因为即使您要求这些存储芯片位于0-23位,也要在这些存储芯片上附加字节。没什么大不了的,因为CPU可以使用逻辑左移所用的相同电路在内部对它们进行微调。然后,在下一个事务存储字节4中,该字节存储在偏移量为1的存储芯片0中,被读入0-7位,并变位为24-31位。


注意这里。您所要求的单词被分成多个偏移量,第一个内存事务从三个芯片的偏移量0读取,第二个内存事务从另一个芯片的偏移量1读取。这就是问题所在。您必须告诉内存芯片偏移量,以便它们可以向您发送正确的数据,偏移量约为40位宽,并且信号非常高速。现在,只有一组偏移信号连接到所有存储芯片,要对未对齐的存储器进行一次事务处理,则需要在每个存储芯片上运行独立的偏移(称为地址总线BTW)。对于64位处理器,您将从一个地址总线更改为八个,增加了将近300个引脚。在CPU使用700至1300个引脚的世界中,这很难被称为电路中的增加不多。更不用说来自这么多额外的高速信号的噪声和串扰的巨大增加。


好,还不是很糟糕,因为最多只能有两个不同的信号。一次在地址总线上偏移,一个总是另一个,再加上一个。这样一来,您就不必再为每个存储芯片增加一根导线了,说(读地址总线上列出的偏移量)或(读下面的偏移量)实际上是两种状态。但是现在每个内存芯片中都有一个加法器,这意味着它必须在实际进行内存访问之前计算偏移量,这会减慢内存的最大时钟速率。这意味着,如果您希望未对齐访问更快,则对齐访问会变慢。由于99.99%的访问权限可以进行对齐,因此这是一笔净损失。


因此,未对齐的访问分为两个步骤。因为地址总线由所有涉及的字节共享。这实际上是一种简化,因为当您具有不同的偏移量时,您还涉及不同的缓存行,因此所有缓存一致性逻辑都必须加倍以处理CPU内核之间的两次通信。


The normal answers to why data alignment is to access more efficiently and to simplify the design of CPU.

A relevant question and its answers is here. And another source is here. But they both do not resolve my question.

Suppose a CPU has a access granularity of 4 bytes. That means the CPU reads 4 bytes at a time. The material I listed above both says that if I access a misaligned data, say address 0x1, then the CPU has to do 2 accesses (one from addresses 0x0, 0x1, 0x2 and 0x3, one from addresses 0x4, 0x5, 0x6 and 0x7) and combine the results. I can't see why. Why just can't CPU read data from 0x1, 0x2, 0x3, 0x4 when I issue accessing address 0x1. It will not degrade the performance and incur much complexity in circuitry.

Thank you in advance!

解决方案

It will not degrade the performance and incur much complexity in circuitry.

It's the false assumptions we take as fact that really throw off further understanding.

Your comment in the other question used much more appropriate wording ("I don't think it would degrade"...)

Did you consider that the memory architecture uses many memory chips in parallel in order to maximize the bandwidth? And that a particular data item is in only one chip, you can't just read whatever chip happens to be most convenient and expect it to have the data you want.

Right now, the CPU and memory can be wired together such that bits 0-7 are wired only to chip 0, 8-15 to chip 1, 16-23 to chip 2, 24-31 to chip 3. And for all integers N, memory location 4N is stored in chip 0, 4N+1 in chip 1, etc. And it is the Nth byte in each of those chips.

Let's look at the memory addresses stored at each offset of each memory chip

memory chip       0       1       2       3
offset

    0             0       1       2       3
    1             4       5       6       7
    2             8       9      10      11
    N            4N    4N+1    4N+2    4N+3

So if you load from memory bytes 0-3, N=0, each chip reports its internal byte 0, the bits all end up in the right places, and everything is great.

Now, if you try to load a word starting at memory location 1, what happens?

First, we look at the way it is done. First memory bytes 1-3, which are stored in memory chips 1-3 at offset 0, end up in bits 8-31, because that's where those memory chips are attached, even though you asked them to be in bits 0-23. This isn't a big deal because the the CPU can swizzle them internally, using the same circuitry used for logical shift left. Then on the next transaction memory byte 4, which is stored in memory chip 0 at offset 1, gets read into bits 0-7 and swizzled into bits 24-31 where you wanted it to be.

Notice something here. The word you asked for is split across offsets, the first memory transaction read from offset 0 of three chips, the second memory transaction read from offset 1 of the other chip. Here's where the problem lies. You have to tell the memory chips the offset so they can send you the right data back, and the offset is ~40 bits wide and the signals are VERY high speed. Right now there is only one set of offset signals that connects to all the memory chips, to do a single transaction for unaligned memory access you would need independent offset (called the address bus BTW) running to each memory chip. For a 64-bit processor, you'd change from one address bus to eight, an increase of almost 300 pins. In a world where CPUs use between 700 and 1300 pins, this can hardly be called "not much increase in circuitry". Not to mention the huge increase in noise and crosstalk from that many extra high-speed signals.

Ok, it isn't quite that bad, because there can only be a maximum of two different offsets out on the address bus at once, and one is always the other plus one. So you could get away with one extra wire to each memory chip, saying in effect either (read the offset listed on the address bus) or (read the offset following) which is two states. But now there's an extra adder in each memory chip, which means it has to calculate the offset before actually doing the memory access, which slows down the maximum clock rate for memory. Which means that aligned access gets slower if you want unaligned access to be faster. Since 99.99% of access can be made aligned, this is a net loss.

So that's why unaligned access gets split into two steps. Because the address bus is shared by all the bytes involved. And this is actually a simplification, because when you have different offsets, you also have different cache lines involved, so all the cache coherency logic would have to double to handle twice the communication between CPU cores.

这篇关于为什么未对齐的地址访问会导致2次或更多次访问?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆