较大的块缓存大小还是较小的块缓存是最佳选择? [英] which is optimal a bigger block cache size or a smaller one?

查看:249
本文介绍了较大的块缓存大小还是较小的块缓存是最佳选择?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于具有恒定容量和关联性的高速缓存大小,对于给定的代码来确定数组元素的平均值,是否会首选具有更高块大小的高速缓存?

[来自评论]

检查下面给出的代码以计算数组的平均值:

total = 0; 
for(j=0; j < k; j++) { 
  sub_total = 0; /* Nested loops to avoid overflow */ 
  for(i=0; i < N; i++) { 
    sub_total += A[jN + i]; 
  } 
  total += sub_total/N; 
} 
average = total/k;

解决方案

相关:在更常见的典型访问模式中,局部位置有限但空间有限,较大的线条可以起到一定作用.这些 内存层次结构:集江缓存" (幻灯片)由Hong Jiang和/或Zhu Yifeng Zhu(美国缅因州)制作,带有AMAT(平均内存访问时间)与块大小的关系图,并显示了一条曲线,并将其分解为未命中损失与未命中率之间的关系(对于一个简单的模型,我认为,对于一个简单的有序CPU而言,它隐藏了内存延迟,例如甚至可能无法流水线多个独立的未命中(未命中的情况)) /p>

这些幻灯片中有很多 好东西,包括编译器优化部分,该部分提到了循环交换(以列优先顺序与行优先顺序修复嵌套循环),甚至缓存阻止,以实现更多重用.互联网上的很多东西都是垃圾,但是我浏览了这些幻灯片,他们对缓存的设计方式和权衡取舍有一定的了解.性能分析的内容仅对简单的CPU才真正准确,不像现代的乱序CPU可以使某些计算具有高速缓存未命中延迟,因此较短的未命中次数与较长的未命中次数不同.


该问题的具体答案:

因此,您关心的唯一工作量是元素的线性遍历?假设良好的硬件预取,这会使高速缓存行大小与性能几乎无关. (因此,相同的性能,较大的线意味着更少的硬件复杂性和功耗.)

使用软件预取时,较大的行意味着较少的预取开销(尽管取决于CPU设计,如果您仍然用尽内存带宽,则可能不会影响性能.)

在没有任何预取的情况下,较大的行/块大小将意味着在每次需求缺失后出现更多点击.数组的单个遍历具有完美的空间局部性,而没有时间局部性. (实际上,如果数组未与高速缓存行的开始对齐和/或在行的中间结束,则实际上在开始/结束处的空间局部性不是很理想.)

如果未命中必须等到整个行出现在高速缓存中,然后才能满足导致未命中的负载,则这会稍微降低较大块的优势. (但是,缓存未命中的大部分延迟都在信令和请求开销中,而不是在突发传输开始后等待其完成.)

更大的块大小意味着在相同的带宽和等待时间下正在运行的请求更少,并且并发受限是实际CPU中内存带宽的真正限制因素. (请参阅此答案中有关延迟绑定平台部分:与L3缓存相比,具有更高延迟的核心Xeon具有比具有相同时钟速度的双核或四核更低的单线程带宽.每个核仅具有10个行填充缓冲区以跟踪未决的L1丢失和bandwidth = concurrency / latency.)

如果您的缓存未命中处理具有提早重启设计,那么甚至可以避免额外的延迟. (这很常见,但保罗说

重要单词优先 是一个相关功能,首先发送所需的单词(用于早期重启),然后突发传输绕回以传输该块的较早单词.在这种情况下,关键单词将始终是第一个单词,因此,除了提早重启之外,不需要任何特殊的硬件支持. (我上面链接的美国缅因州幻灯片首先提到了提早重启/关键单词,并指出它降低了大型缓存行的未命中率.)


无序执行CPU(或有序CPU上的软件流水线)可以通过同时具有多个未解决的需求缺失,为您提供相当于硬件预取的功能.如果CPU看到"负载到另一条高速缓存行,而当前高速缓存行的未命中仍然很严重,则可以对需求丢失进行流水线处理,从而再次隐藏了较大或较小行之间的一些差异.

如果行太小,您的L1D可以跟踪的不同行的未命中次数会受到限制.对于较大的行或较小的无序窗口,当没有对下一个缓存行的未完成请求时,您可能会有些松弛",因此不会使带宽最大化.当您到达高速缓存行的末尾而下一行的开始尚未到达时,您会用管道中的气泡来为它付费,因为它开始得太晚了(而ALU执行单元使用的数据太近了.当前缓存行的末尾.)


相关:这些幻灯片不用多说大线和小线的权衡,但看起来还不错.

Given a cache size with constant capacity and associativity, for a given code to determine average of array elements, would a cache with higher block size be preferred?

[from comments]

Examine the code given below to compute the average of an array:

total = 0; 
for(j=0; j < k; j++) { 
  sub_total = 0; /* Nested loops to avoid overflow */ 
  for(i=0; i < N; i++) { 
    sub_total += A[jN + i]; 
  } 
  total += sub_total/N; 
} 
average = total/k;

解决方案

Related: in the more general case of typical access patterns with some but limited spatial locality, larger lines help up to a point. These "Memory Hierarchy: Set-Associative Cache" (powerpoint) slides by Hong Jiang and/or Yifeng Zhu (U. Maine) have a graph of AMAT (Average Memory Access Time) vs. block size showing a curve, and also breaking it down into miss penalty vs. miss rate (for a simple model I think, for a simple in-order CPU that sucks at hiding memory latency. e.g. maybe not even pipelining multiple independent misses. (miss under miss))

There is a lot of good stuff in those slides, including a compiler-optimization section that mentions loop interchange (to fix nested loops with column-major vs. row-major order), and even cache-blocking for more reuse. A lot of stuff on the Internet is crap, but I looked through these slides and they have some solid info on how caches are designed and what the tradeoffs are. The performance-analysis stuff is only really accurate for simple CPUs, not like modern out-of-order CPUs that can overlap some computation with cache-miss latency so more shorter misses is different from fewer longer misses.


Specific answer to this question:

So the only workload you care about is a linear traversal of your elements? That makes cache line size nearly irrelevant for performance, assuming good hardware prefetching. (So larger lines mean less HW complexity and power usage for the same performance.)

With software prefetch, larger lines mean less prefetch overhead (although depending on the CPU design, that may not hurt performance if you still max out memory bandwidth.)

Without any prefetching, a larger line/block size would mean more hits following every demand-miss. A single traversal of an array has perfect spatial locality and no temporal locality. (Actually not quite perfect spatial locality at the start/end, if the array isn't aligned to the start of a cache line, and/or ends in the middle of a line.)

If a miss has to wait until the entire line is present in cache before the load that caused the miss can be satisfied, this slightly reduces the advantage of larger blocks. (But most of the latency of a cache miss is in the signalling and request overhead, not in waiting for the burst transfer to complete after it's already started.)

A larger block size means fewer requests in flight with the same bandwidth and latency, and limited concurrency is a real limiting factor in memory bandwidth in real CPUs. (See the latency-bound platforms part of this answer about x86 memory bandwidth: many-core Xeons with higher latency to L3 cache have lower single-threaded bandwidth than a dual or quad-core of the same clock speed. Each core only has 10 line-fill buffers to track outstanding L1 misses, and bandwidth = concurrency / latency.)

If your cache-miss handling has an early restart design, even that bit of extra latency can be avoided. (That's very common, but Paul says theoretically possible to not have it in a CPU design). The load that caused the miss gets its data as soon as it arrives. The rest of the cache line fill happens "in the background", and hopefully later loads can also be satisfied from the partially-received cache line.

Critical word first is a related feature, where the needed word is sent first (for use with early restart), and the burst transfer then wraps around to transfer the earlier words of the block. In this case, the critical word will always be the first word, so no special hardware support is needed beyond early restart. (The U. Maine slides I linked above mention early restart / critical word first and point out that it decreases the miss penalty for large cache lines.)


An out-of-order execution CPU (or software pipelining on an in-order CPU) could give you the equivalent of HW prefetch by having multiple demand-misses outstanding at once. If the CPU "sees" the loads to another cache line while a miss to the current cache line is still outstanding, the demand-misses can be pipelined, again hiding some of the difference between larger or smaller lines.

If lines are too small, you'll run into a limit on how many outstanding misses for different lines your L1D can track. With larger lines or smaller out-of-order windows, you might have some "slack" when there's no outstanding request for the next cache line, so you're not maxing out the bandwidth. And you pay for it with bubbles in the pipeline when you get to the end of a cache line and the start of the next line hasn't arrived yet, because it started too late (while ALU execution units were using data from too close to the end of the current cache line.)


Related: these slides don't say much about the tradeoff of larger vs. smaller lines, but look pretty good.

这篇关于较大的块缓存大小还是较小的块缓存是最佳选择?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆