多核架构中的CPU和内存访问 [英] CPUs in multi-core architectures and memory access

查看:1041
本文介绍了多核架构中的CPU和内存访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如果一般如何处理内存访问,例如,如果2个CPU核心试图同时访问内存(通过内存控制器)?实际上,当核心和支持DMA的IO设备尝试以相同方式访问时也是如此。

I wondered how memory access is handled "in general" if ,for example, 2 cores of CPU try to access memory at the same time (over the memory controller)? Actually the same applies when a core and an DMA-enabled IO device try to access in the same way.

我认为,内存控制器足够聪明,可以利用地址总线并同时处理这些请求,但我不确定当他们尝试时会发生什么访问相同位置或IO操作独占地址总线并且CPU无法继续前进。

I think, memory controller is smart enough to utilise the address bus and handle those requests concurrently, however I'm not sure what happens when they try to access to same location or when the IO operation monopolises the address bus and there's no room for CPU to move on.

Thx

推荐答案

简短的回答是这很复杂,但是在某些情况下,访问肯定会潜在并行发生。

The short answer is "it's complex, but access can certainly potentially occur in parallel in certain situations".

我认为你的问题有点过于黑白:你可能正在寻找对于是的,多个设备可以在相同的时间访问内存或不是他们不能的答案,但实际情况是,首先您需要描述一些特定的硬件配置,包括一些低级实现细节和优化功能,以获得准确的答案。最后你需要在同一时间准确定义你的意思。

I think your question is a bit too black and white: you may be looking for an answer like "yes, multiple devices can access memory at the same time" or "no they can't", but the reality is that first you'd need to describe some specific hardware configuration, including some of the low-level implementation details and optimization features to get an exact answer. Finally you'd need to define exactly what you mean by "the same time".

一般来说,良好的一阶近似是硬件会使它成为出现所有硬件几乎可以同时访问内存,可能会增加延迟并因争用而减少带宽。在非常细粒度的定时级别访问一个设备可能确实推迟了另一个设备的访问,或者它可能不会,这取决于许多因素。您不太可能需要此信息来正确实施软件,并且您甚至不太可能需要了解详细信息以获得最佳性能。

In general, a good first-order approximation is that hardware will make it appear that all hardware can access memory approximately simultaneously, possibly with an increase in latency and a decrease in bandwidth due to contention. At the very fine-grained timing level access one device may indeed postpone access by another device, or it may not, depending on many factors. It is extremely unlikely you would need this information to implement software correctly, and quite unlikely you need to know the details even to maximize performance.

也就是说,如果你真的需要知道细节,请继续阅读,我可以对某种理想化的latpop /桌面/服务器规模硬件进行一些一般性的观察。

That said, if you really need to know the details, read on and I can give some general observations on some kind of idealized latpop/desktop/server scale hardware.

As马蒂亚斯提到,你首先要考虑缓存。缓存意味着任何受缓存的读取或写入操作(包括几乎所有CPU请求和许多其他类型的请求)都可能根本不接触内存,因此从这个意义上说,许多内核可以访问内存(至少是缓存)它的图像同时显示。

As Matthias mentioned, you first have to consider caching. Caching means that any read or write operation subject to caching (which includes nearly all CPU requests and many other types of requests as well) may not touch memory at all, so in that sense many cores can "access" memory (at least the cache image of it) simultaneous.

如果您考虑在所有缓存级别中丢失的请求,您需要了解内存子系统的配置。一般来说,RAM芯片一次只能做一件事(即命令 1 这样的读写操作适用于整个模块),并且通常扩展到由几个芯片组成的DRAM模块和还有一系列DRAM通过总线连接到单个内存控制器。

If you then consider requests that miss in all cache levels, you need to know about the configuration of the memory subsystem. In general a RAM chips can only do "one thing" at a time (i.e., commands1 such a read and write apply to the entire module) and that usually extends to DRAM modules comprised of several chips and also to a series of DRAMs connected via a bus to a single memory controller.

所以你可以说说来,一个内存控制器及其连接的RAM的组合很可能只在 thing 上同时进行。现在 thing 通常类似于从物理上连续的字节跨度中读取字节,但该操作实际上可以帮助同时处理来自不同设备的多个请求:即使每个设备向控制器发送单独的请求,良好的实施将合并请求到相同或附近的 2 内存区域。

So you can say that electrically speaking, the combination of one memory controller and its attached RAM is likely to be doing only on thing at once. Now that thing is usually something like reading bytes out of a physically contiguous span of bytes, but that operation could actually help handle several requests from different devices at once: even though each devices sends separate requests to the controller, good implementations will coalesce requests to the same or nearby2 area of memory.

此外,即使CPU可能具有这样的能力:当新请求发生时,它可以/必须注意到现有请求正在进行中重叠区域并将新请求与旧请求联系起来。

Furthermore, even the CPU may have such abilities: when a new request occurs it can/must notice that an existing request is in progress for an overlapping region and tie the new request to an old one.

但是,对于单个内存控制器,您可以说通常一次服务于一个设备的请求,没有异常的机会来组合请求。现在请求本身通常是纳秒级,因此可以在很短的时间内提供许多单独的请求,因此这种排他性是细粒度的,并且通常不会明显 3

Still, you can say that for a single memory controller you'll usually be serving the request of one device at a time, absent unusual opportunities to combine requests. Now the requests themselves are typically on the order of nanoseconds, so many separate requests can be served in a small unit of time, so this "exclusiveness" fine-grained and not generally noticeable3.

现在上面我小心翼翼地将讨论局限于单个内存控制器 - 当你有多个内存控制器 4 时你绝对可以有多个设备甚至在RAM级别同时访问内存。这里每个控制器基本上是独立的,所以如果来自两个设备的请求映射到不同的控制器(不同的NUMA区域),它们可以并行进行。

Now above I was careful to limit the discussion to a single memory-controller - when you have multiple memory controllers4 you can definitely have multiple devices accessing memory simultaneously even at the RAM level. Here each controller is essentially independent, so if the requests from two devices map to different controllers (different NUMA regions) they can proceed in parallel.

这是答案很长。

1 实际上,命令流比读取更低级别更复杂或写并涉及诸如打开内存页面,从中传输字节等概念等。每个程序员应该了解内存是一个很好的主题介绍。

1 In fact, the command stream is lower level and more complex than things like "read" or "write" and involves concepts such as opening a memory page, streaming bytes from it, etc. What every programmer should know about memory serves as an excellent intro to the topic.

2 例如,想象两个相邻字节的请求内存:如果控制器符合总线宽度,控制器可以将它们组合成单个请求。

2 For example, imagine two requests for adjacent bytes in memory: it is possible the controller can combine them into a single request if they fit within the bus width.

3 当然,如果你是在多个设备上竞争内存,整体影响可能非常明显:每个设备带宽的减少和latenc的增加y,但我的意思是,共享的细粒度足够大,以至于您通常无法区分精细切片的独占访问和某些假设设备,这些设备会在每个时间段内对每个请求同时进行。

3 Of course if you are competing for memory across several devices, the overall impact may be very noticeable: a reduction in per-device bandwidth and an increase in latency, but what I mean is that the sharing is fine-grained enough that you can't generally tell the difference between finely-sliced exclusive access and some hypothetical device which makes simultaneous progress on each request in each period.

4 现代硬件上最常见的配置是每个插槽一个内存控制器,所以在2P系统上,你通常有两个控制器,也有其他配置(两个都是更高和更低)当然是可能的。

4 The most common configuration on modern hardware is one memory controller per socket, so on a 2P system you'd usually have two controllers, also other rations (both higher and lower) are certainly possible.

这篇关于多核架构中的CPU和内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆