缓存友好的离线随机读取 [英] Cache friendly offline random read

查看:103
本文介绍了缓存友好的离线随机读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在C ++中考虑此功能:

 void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
    while (b1 != b2) {
        // assert(0 <= *b1 && *b1 < a2 - a1)
        *o++ = a1[*b1++];
    }
}
 

其目的应该足够清楚.不幸的是,b1包含 random 数据并破坏了缓存,这使foo成为了我程序的瓶颈.无论如何,我可以对其进行优化吗?

这是一个类似于我的实际代码的SSCCE:

 #include <iostream>
#include <chrono>
#include <algorithm>
#include <numeric>

namespace {
    void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
        while (b1 != b2) {
            // assert(0 <= *b1 && *b1 < a2 - a1)
            *o++ = a1[*b1++];
        }
    }

    constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
    uint32_t data[max_n], index[max_q], result[max_q];
}

int main() {
    uint32_t seed = 0;
    auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
    std::generate_n(data, max_n, rng);
    std::generate_n(index, max_q, [rng]() { return rng() % max_n; });

    auto t1 = std::chrono::high_resolution_clock::now();
    foo(data, data + max_n, index, index + max_q, result);
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;

    uint32_t hash = 0;
    for (unsigned i = 0; i < max_q; i++)
        hash += result[i] ^ (i << 8) ^ i;
    std::cout << hash << std::endl;
}
 

这不是对数组进行缓存友好的复制,并通过已知索引,聚集,分散进行重新调整,它询问随机写入并假定 b 是一个排列.

解决方案

首先,让我们看一下上面代码的实际性能:

$ sudo perf stat ./offline-read
0.123023
1451229184

 Performance counter stats for './offline-read':

        184.661547      task-clock (msec)         #    0.997 CPUs utilized          
                 3      context-switches          #    0.016 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               717      page-faults               #    0.004 M/sec                  
       623,638,834      cycles                    #    3.377 GHz                    
       419,309,952      instructions              #    0.67  insn per cycle         
        70,803,672      branches                  #  383.424 M/sec                  
            16,895      branch-misses             #    0.02% of all branches        

       0.185129552 seconds time elapsed

我们得到的IPC较低,为0.67,这可能几乎完全是由于DRAM 5 的负载丢失造成的.让我们确认一下:

sudo ../pmu-tools/ocperf.py stat -e cycles,LLC-load-misses,cycle_activity.stalls_l3_miss ./offline-read
perf stat -e cycles,LLC-load-misses,cpu/event=0xa3,umask=0x6,cmask=6,name=cycle_activity_stalls_l3_miss/ ./offline-read
0.123979
1451229184

 Performance counter stats for './offline-read':

       622,661,371      cycles                                                      
        16,114,063      LLC-load-misses                                             
       368,395,404      cycle_activity_stalls_l3_miss                                   

       0.184045411 seconds time elapsed

因此,在620k中,约有370k循环由于未击中而直接停顿.实际上,在foo()中以这种方式停顿的那部分周期要高得多,接近90%,因为perf也在测量init和accumulate代码,这大约占用了运行时间的三分之一(但没有重大的三级失误).

这并不是什么意外的事情,因为我们知道随机读取模式a1[*b1++]基本上将具有零局部性.实际上,LLC-load-misses的数量为1600万个 1 ,几乎完全对应于a1的1600万个随机读取. 2

如果仅假设foo()的100%花费在等待内存访问上,则可以了解每次未命中的总成本:0.123 sec / 16,114,063 misses == 7.63 ns/miss.在我的盒子上,在最佳情况下,内存延迟大约为60 ns,因此每次未命中少于8 ns意味着我们已经在提取大量的内存级并行性(MLP):大约8个未命中必须重叠,并且-平均而言就可以实现这一目标(甚至完全忽略了b1的流负载和o的流写入带来的额外流量).

因此,我认为您可以对简单循环进行许多调整以使其做得更好.仍然有两种可能性:

  • 非临时存储,用于存储对o的写入(如果您的平台支持它们).这将消除 RFO 对普通商店的隐含读取.这应该是直接的胜利,因为o再也不会被读取(在定时部分内!).
  • 软件预取.仔细调整a1b1的预取可能对一点帮助.但是,由于我们已经接近了如上所述的MLP限制,因此影响将是相当有限的.此外,我们希望b1的线性读取几乎可以完全由硬件预取器预取.随机读取a1似乎可以进行预取,但是实际上,循环中的ILP通过乱序处理(至少在像最近的x86这样的大型OoO处理器上)会导致足够的MLP.

    哈罗德(Harold)用户在评论中已经提到,他尝试预取的效果很小.

因此,由于简单的调整不太可能见效,因此您需要转换循环.一种显而易见的"转换是对索引b1进行排序(以及索引元素的原始位置),然后按排序顺序从a1进行读取.这会将a1的读取从完全随机的转换为几乎 3 线性的,但是现在写入都是随机的,这再好不过了.

排序然后取消排序

关键问题是在b1的控制下a1的读取是随机的,而a1很大,因此每次读取都会导致DRAM丢失.我们可以通过对b1进行排序,然后读取a1来解决此问题,以获得排列结果.现在,您需要取消置换"结果a1以获得最终顺序的结果,这只是另一种形式,这一次是在输出索引"上.

这是一个工作示例,具有给定的输入数组a,索引数组b和输出数组o,以及i,这是每个元素的(隐式)位置:

  i =   0   1   2   3
  a = [00, 10, 20, 30]
  b = [ 3,  1,  0,  1]
  o = [30, 10, 00, 10] (desired result)

首先,对数组b进行排序,并将原始数组位置i作为辅助数据(或者,您可能将其视为排序元组(b[0], 0), (b[1], 1), ...),这为您提供了已排序的b数组b'和索引列表i'排序,如下所示:

  i' = [ 2, 1, 3, 0]
  b' = [ 0, 1, 1, 3]

现在,您可以在b'的控制下从a读取排列的结果数组o'.严格按顺序增加此读取,并且应该能够以接近memcpy的速度运行.实际上,您可能可以利用宽范围的连续SIMD读取和一些改组来进行几次读取,然后一次将4字节的元素移到正确的位置(复制某些元素并跳过其他元素):

  a  = [00, 10, 20, 30]
  b' = [ 0,  1,  1,  3]
  o' = [00, 10, 10, 30]

最后,您可以从概念上简单地通过对排列的索引i'排序o'来对o'进行反置换以获得o:

  i' = [ 2,  1,  3,  0]
  o' = [00, 10, 10, 30]
  i  = [ 0,  1,  2,  3]
  o  = [30, 10, 00, 10]

完成!

现在,这是该技术的最简单概念,并且并不是特别适合缓存(每次传递在概念上都迭代一个或多个2 ^ 26字节数组),但是它至少充分利用了它读取的每个缓存行(与原始循环仅从缓存行中读取单个元素,这就是为什么即使数据仅占用一百万个缓存行也有1600万次未命中的原因!).所有读取或多或少都是线性的,因此硬件预取将有很大帮助.

您可能获得多大的加速取决于您如何实现排序:它们需要快速并且对缓存敏感.几乎可以肯定,某种类型的可识别缓存的基数排序将最有效.

以下是一些有关进一步改进此方法的注释:

优化排序量

您实际上不需要完全对b进行排序.您只想对其进行足够"的排序,以使在b'的控制下对a的后续读取或多或少是线性的.例如,高速缓存行中可以容纳16个元素,因此根本不需要根据最后4位进行排序:无论如何,都将读取相同的线性高速缓存行序列.您还可以对更少的位进行排序:例如,如果忽略了5个最低有效位,则将以几乎线性"的方式读取高速缓存行,有时会从完全线性的模式中交换两条高速缓存行,例如:0, 1, 3, 2, 5, 4, 6, 7 .在这里,您仍将获得L1高速缓存的全部好处(随后对高速缓存行的读取将始终有效),而且我怀疑这样的模式仍会被很好地预取,如果没有,您总是可以通过软件预取来帮助它. /p>

您可以在系统上测试被忽略的位数的最佳数量.忽略位有两个好处:

  • 在基数搜索中要做的工作很少,要么需要的遍数较少,要么一次或多次遍中需要较少的存储桶(这有助于缓存).
  • 在最后一步中撤消"排列的工作可能更少:如果通过检查原始索引数组b撤消,忽略位意味着您在撤消搜索时获得了相同的节省.

缓存会阻止工作

上面的描述将所有内容按几个连续的,不相交的过程进行布局,每个过程都对整个数据集起作用.在实践中,您可能希望对它们进行交错以获得更好的缓存行为.例如,假设您使用MSD radix-256排序,则可以进行第一遍,将数据排序到256个存储桶中,每个存储桶中大约有256K个元素.

然后,您可以只对前几个(或前几个)存储桶进行排序,而不是进行完整的第二遍处理,然后根据生成的b'块继续读取a.您可以确保此块是连续的(即最终排序序列的后缀),因此您不会在读取中放弃任何局部性,并且通常会缓存您的读取.因为o'的块在缓存中也很热,所以您也可以进行第一遍置换o'的操作(也许您可以将后两个阶段组合成一个循环).

智能反置换

要优化的领域之一是o'的去置换是如何实现的.在上面的描述中,我们假设某个索引数组i最初的值是[0, 1, 2, ..., max_q],该值与b一起排序.从概念上说,这是如何工作的,但是您可能不需要立即实际实现i并将其作为辅助数据进行排序.例如,在基数排序的第一遍中,i的值是隐式已知的(因为您正在遍历数据),因此可以为free 4 计算并在执行期间将其写出第一次通过,但没有按顺序出现.

可能还有比维护完整索引更有效的方法来执行未排序"操作.例如,原始的未排序b数组从概念上讲具有执行未排序所需的所有信息,但是我很清楚如何使用它来有效地进行未排序.

会更快吗?

那么,这实际上会比幼稚的方法快吗?它在很大程度上取决于实现细节,特别是包括实现的排序的效率.在我的硬件上,幼稚的方法每秒处理约1.4亿个元素.可以识别缓存的基数种类的在线描述似乎从200到6亿个元素/秒不等,并且由于您需要其中的两个,因此,如果您相信这些数字,那么实现大幅提速的机会似乎就很有限.另一方面,这些数字来自较旧的硬件,并且用于更一般的搜索(例如,对于密钥的所有32位,而我们可能只能使用16位).

只有仔细执行才能确定它是否可行,并且可行性还取决于硬件.例如,在不能支持那么多MLP的硬件上,排序-不排序方法变得相对更有利.

最佳方法还取决于max_nmax_q的相对值.例如,如果max_n >> max_q,则即使具有最佳排序,读取结果也将是稀疏"的,因此,幼稚的方法会更好.另一方面,如果max_n << max_q,则通常将多次读取同一索引,因此排序方法将具有良好的读取位置,排序步骤本身将具有更好的位置,并且可能有可能进一步进行优化以显式处理重复读取

多核

从这个问题尚不清楚您是否对并行化感兴趣. foo()的幼稚解决方案确实已经接受了简单"的并行化,您只需将ab数组简单地划分为相等大小的块(对于每个线程),这似乎可以提供完美的加速效果.不幸的是,您可能会发现比线性缩放更糟,因为您将遇到内存控制器中的资源争用以及套接字上所有内核之间共享的相关非核心/非核心资源.因此,不清楚的是,当您添加更多内核时,纯并行随机读取负载将获得多少吞吐量 6 .

对于基数排序版本,大多数瓶颈(存储吞吐量,总指令吞吐量)都在内核中,因此我希望它可以随着其他内核而合理地扩展.正如Peter在评论中提到的那样,如果您使用超线程,则排序可能在核心本地L1和L2高速缓存中具有良好的局部性的额外好处,可有效地使每个同级线程使用整个高速缓存,而不是将有效容量减少一半.当然,这涉及到仔细管理线程的亲和力,以便同级线程实际使用附近的数据,而不仅仅是让调度程序执行它的工作.


1 您可能会问为什么LLC-load-misses没有说32或4,800万,因为我们还必须先读取b1的所有1600万个元素,然后再读取accumulate()读取所有result.答案是,LLC-load-misses仅计算L3中实际上未命中的需求未命中.上面提到的其他读取模式是完全线性的,因此预取器将始终在需要时将线引入L3.根据perf的使用定义,这些不会算作"LLC未命中".

2 您可能想知道如何知道负载丢失全部来自foo中对a1的读取:我只是使用了perf recordperf mem确认未命中来自预期的汇编指令.

3 几乎是线性的,因为b1并不是所有索引的排列,因此原则上可以跳过和重复索引.但是,在高速缓存行级别,很有可能会按顺序读取每条高速缓存行,因为每个元素被包含的机率约为63%,并且高速缓存行有16个4字节元素,因此仅任何给定的缓存中有 zero 个元素的概率大约为1千万分之一.因此,在缓存行级别上有效的预取会很好地工作.

4 在这里,我的意思是,值的计算是免费的或差不多免费的,但是当然写仍然要花钱.但是,这仍然比前期实现"方法好得多,该方法首先创建需要max_q写入的i数组[0, 1, 2, ...],然后再次需要另一个max_q写入以将其进行第一个基数排序经过.隐式实现仅导致第二次写入.

5 实际上,实际定时段foo()的IPC比低:根据我的计算,大约为0.15.报告的整个过程的IPC是定时部分IPC以及初始化和累积代码之前和之后的IPC的平均值,IPC则要高得多.

6 值得注意的是,这与依赖负载延迟绑定工作流的扩展方式不同:正在执行随机读取但只能进行一个负载的负载,因为每个负载都取决于结果的最后一个扩展可以很好地扩展到多个内核,因为负载的串行特性不会使用很多下游资源(但是,即使在单个内核上,也可以通过更改内核循环以处理多个相关的负载流来加快此类负载的速度)并行).

Consider this function in C++:

void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
    while (b1 != b2) {
        // assert(0 <= *b1 && *b1 < a2 - a1)
        *o++ = a1[*b1++];
    }
}

Its purpose should be clear enough. Unfortunately, b1 contains random data and trash the cache, making foo the bottleneck of my program. Is there anyway I can optimize it?

This is an SSCCE that should resemble my actual code:

#include <iostream>
#include <chrono>
#include <algorithm>
#include <numeric>

namespace {
    void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
        while (b1 != b2) {
            // assert(0 <= *b1 && *b1 < a2 - a1)
            *o++ = a1[*b1++];
        }
    }

    constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
    uint32_t data[max_n], index[max_q], result[max_q];
}

int main() {
    uint32_t seed = 0;
    auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
    std::generate_n(data, max_n, rng);
    std::generate_n(index, max_q, [rng]() { return rng() % max_n; });

    auto t1 = std::chrono::high_resolution_clock::now();
    foo(data, data + max_n, index, index + max_q, result);
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;

    uint32_t hash = 0;
    for (unsigned i = 0; i < max_q; i++)
        hash += result[i] ^ (i << 8) ^ i;
    std::cout << hash << std::endl;
}

This is not Cache-friendly copying of an array with readjustment by known index, gather, scatter, which asks about random writes and assumes b is a permutation.

解决方案

First, let's take a look at the actual performance of the code above:

$ sudo perf stat ./offline-read
0.123023
1451229184

 Performance counter stats for './offline-read':

        184.661547      task-clock (msec)         #    0.997 CPUs utilized          
                 3      context-switches          #    0.016 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               717      page-faults               #    0.004 M/sec                  
       623,638,834      cycles                    #    3.377 GHz                    
       419,309,952      instructions              #    0.67  insn per cycle         
        70,803,672      branches                  #  383.424 M/sec                  
            16,895      branch-misses             #    0.02% of all branches        

       0.185129552 seconds time elapsed

We are getting a low IPC of 0.67, probably caused almost entirely by load-misses to DRAM5. Let's confirm:

sudo ../pmu-tools/ocperf.py stat -e cycles,LLC-load-misses,cycle_activity.stalls_l3_miss ./offline-read
perf stat -e cycles,LLC-load-misses,cpu/event=0xa3,umask=0x6,cmask=6,name=cycle_activity_stalls_l3_miss/ ./offline-read
0.123979
1451229184

 Performance counter stats for './offline-read':

       622,661,371      cycles                                                      
        16,114,063      LLC-load-misses                                             
       368,395,404      cycle_activity_stalls_l3_miss                                   

       0.184045411 seconds time elapsed

So ~370k cycles out of 620k are straight-up stalled on outstanding misses. In fact, the portion of cycles stalled this way in foo() is much higher, close to 90% since perf is also measuring the init and accumulate code which takes about a third of the runtime (but doesn't have significant L3 misses).

This is nothing unexpected, since we knew the random-read pattern a1[*b1++] was going to have essentially zero locality. In fact, the number of LLC-load-misses is 16 million1, corresponding almost exactly to the 16 million random reads of a1.2

If we just assume 100% of foo() is spending waiting on memory access, we can get an idea of the total cost of each miss: 0.123 sec / 16,114,063 misses == 7.63 ns/miss. On my box, the memory latency is around 60 ns in the best case, so less than 8 ns per miss means we are already extracting a lot of memory-level parallelism (MLP): about 8 misses would have to be overlapped and in-flight on average to achieve this (even totally ignoring the additional traffic from the streaming load of b1 and streaming write of o).

So I don't think there are many tweaks you can apply to the simple loop to do much better. Still, two possibilities are:

  • Non-temporal stores for the writes to o, if your platform supports them. This would cut out the reads implied by RFO for normal stores. It should be a straight win since o is never read again (inside the timed portion!).
  • Software prefetching. Carefully tuned prefetching of a1 or b1 could potentially help a bit. The impact is going to be fairly limited, however, since we are already approaching the limits of MLP as described above. Also, we expect the linear reads of b1 to be almost perfectly prefetched by the hardware prefetchers. The random reads of a1 seem like they could be amenable to prefetching, but in practice the ILP in the loop leads to enough MLP though out-of-order processing (at least on big OoO processors like recent x86).

    In the comments user harold already mentioned that he tried prefetching with only a small effect.

So since the simple tweaks aren't likely to bear much fruit, you are left with transforming the loop. One "obvious" transformation is to sort the indexes b1 (along with the index element's original position) and then do the reads from a1 in sorted order. This transforms the reads of a1 from completely random, to almost3 linear, but now the writes are all random, which is no better.

Sort and then unsort

The key problem is that the reads of a1 under control of b1 are random, and a1 is large you get a miss-to-DRAM for essentially every read. We can fix that by sorting b1, and then reading a1 in order to get a permuted result. Now you need to "un-permute" the result a1 to get the result in the final order, which is simply another sort, this time on the "output index".

Here's a worked example with the given input array a, index array b and output array o, and i which is the (implicit) position of each element:

  i =   0   1   2   3
  a = [00, 10, 20, 30]
  b = [ 3,  1,  0,  1]
  o = [30, 10, 00, 10] (desired result)

First, sort array b, with the original array position i as secondary data (alternately you may see this as sorting tuples (b[0], 0), (b[1], 1), ...), this gives you the sorted b array b' and the sorted index list i' as shown:

  i' = [ 2, 1, 3, 0]
  b' = [ 0, 1, 1, 3]

Now you can read the permuted result array o' from a under the control of b'. This read is strictly increasing in order, and should be able to operate at close to memcpy speeds. In fact you may be able to take advantage of wide contiguous SIMD reads and some shuffles to do several reads and once and move the 4-byte elements into the right place (duplicating some elements and skipping others):

  a  = [00, 10, 20, 30]
  b' = [ 0,  1,  1,  3]
  o' = [00, 10, 10, 30]

Finally, you de-permute o' to get o, conceptually simply by sorting o' on the permuted indexes i':

  i' = [ 2,  1,  3,  0]
  o' = [00, 10, 10, 30]
  i  = [ 0,  1,  2,  3]
  o  = [30, 10, 00, 10]

Finished!

Now this is the simplest idea of the technique and isn't particularly cache-friendly (each pass conceptually iterates over one or more 2^26-byte arrays), but it at least fully uses every cache line it reads (unlike the original loop which only reads a single element from a cache line, which is why you have 16 million misses even though the data only occupies 1 million cache lines!). All of the reads are more or less linear, so hardware prefetching will help a lot.

How much speedup you get probably large depends on how will you implement the sorts: they need to be fast and cache sensitive. Almost certainly some type of cache-aware radix sort will work best.

Here are some notes on ways to improve this further:

Optimize the amount of sorting

You don't actually need to fully sort b. You just want to sort it "enough" such that the subsequent reads of a under the control of b' are more or less linear. For example, 16 elements fit in a cache line, so you don't need to sort based on the last 4 bits at all: the same linear sequence of cache lines will be read anyways. You could also sort on even fewer bits: e.g., if you ignored the 5 least-significant bits, you'd read cache lines in an "almost linear" way, sometimes swapping two cache lines from the perfectly linear pattern like: 0, 1, 3, 2, 5, 4, 6, 7. Here, you'll still get the full benefit of the L1 cache (subsequent reads to a cache line will always hit), and I suspect such a pattern would still be prefetched well and if not you can always help it with software prefetching.

You can test on your system what the optimal number of ignored bits is. Ignoring bits has two benefits:

  • Less work to do in the radix search, either from fewer passes needed or needing fewer buckets in one or more passes (which helps caching).
  • Potentially less work to do to "undo" the permutation in the last step: if the undo by examining the original index array b, ignoring bits means that you get the same savings when undoing the search.

Cache block the work

The above description lays out everything in several sequential, disjoint passes that each work on the entire data set. In practice, you'd probably want to interleave them to get better caching behavior. For example, assuming you use an MSD radix-256 sort, you might do the first pass, sorting the data into 256 buckets of approximately 256K elements each.

Then rather than doing the full second pass, you might finish sorting only the first (or first few) buckets, and proceed to do the read of a based on the resulting block of b'. You are guaranteed that this block is contiguous (i.e., a suffix of the final sorted sequence) so you don't give up any locality in the read, and your reads will generally be cached. You may also do the first pass of de-permuting o' since the block of o' is also hot in the cache (and perhaps you can combine the latter two phases into one loop).

Smart De-permutation

One area for optimization is how exactly the de-permutation of o' is implemented. In the description above, we assume some index array i initially with values [0, 1, 2, ..., max_q] which is sorted along with b. That's conceptually how it works, but you may not need to actually materialize i right away and sort it as auxillary data. In the first pass of the radix sort, for example, the value of i is implicitly known (since you are iterating through the data), so it could be calculated for free4 and written out during the first pass without every having appeared in sorted order.

There may also be more efficient ways to do the "unsort" operation than maintaining the full index. For example, the original unsorted b array conceptually has all the information needed to do the unsort, but it is clear to me how to use to efficiently unsort.

Is it be faster?

So will this actually be faster than the naive approach? It depends largely on implementation details especially including the efficiency of the implemented sort. On my hardware, the naive approach is processing about ~140 million elements per second. Online descriptions of cache-aware radix sorts seem to vary from perhaps 200 to 600 million elements/s, and since you need two of those, the opportunity for a big speedup would seem limited if you believe those numbers. On other hand, those numbers are from older hardware, and for slightly more general searches (e.g,. for all 32 bits of the key, while we may be able to use as few as 16 bits).

Only a careful implementation will determine if it is feasible, and feasibility also depends on the hardware. For example, on hardware that can't sustain as much MLP, the sorting-unsorting approach becomes relatively more favorable.

The best approach also depends on the relative values of max_n and max_q. For example, if max_n >> max_q, then the reads will be "sparse" even with optimal sorting, so the naive approach would be better. On the other hand if max_n << max_q, then the same index will usually be read many times, so the sorting approach will have good read locality, the sorting steps will themselves have better locality, and further optimizations which handle duplicate reads explicitly may be possible.

Multiple Cores

It isn't clear from the question whether you are interested in parallelizing this. The naive solution for foo() already does admit a "straightforward" parallelization where you simply partition the a and b arrays into equal sized chunks, on for each thread, which would seem to provide a perfect speedup. Unfortunately, you'll probably find that you get much worse than linear scaling, because you'll be running into resource contention in the memory controller and associated uncore/offcore resources which are shared between all cores on a socket. So it isn't clear how much more throughput you'll get for a purely parallel random read load to memory as you add more cores6.

For the radix-sort version, most of the bottlenecks (store throughput, total instruction throughput) are in the core, so I expect it to scale reasonably with additional cores. As Peter mentioned in the comment, if you are using hyperthreading, the sort may have the additional benefit of good locality in the core local L1 and L2 caches, effectively letting each sibling thread use the entire cache, rather than cutting the effective capacity in half. Of course, that involves carefully managing your thread affinity so that sibling threads actually use nearby data, and not just letting the scheduler do whatever it does.


1 You might ask why the LLC-load-misses isn't say 32 or 48 million, given that we also have to read all 16 million elements of b1 and then the accumulate() call reads all of result. The answer is that LLC-load-misses only counts demand misses that actually miss in the L3. The other mentioned read patterns are totally linear, so the prefetchers will always be bringing the line into the L3 before it is needed. These don't count as "LLC misses" by the definition perf uses.

2 You might want to know how I know that the load misses all come from the reads of a1 in foo: I simply used perf record and perf mem to confirm that the misses were coming from the expected assembly instruction.

3 Almost linear because b1 is not a permutation of all indexes, so in principle there can be skipped and duplicate indexes. At the cache-line level, however, it is highly likely that every cache line will be read in-order since each element has a ~63% chance of being included, and a cache line has 16 4-byte elements, so there's only about a 1 in 10 million chance that any given cache has zero elements. So prefetching, which works at the cache line level, will work fine.

4 Here I mean that the calculation of the value comes for free or nearly so, but of course the write still costs. This is still much better than the "up-front materialization" approach, however, which first creates the i array [0, 1, 2, ...] needing max_q writes and then again needs another max_q writes to sort it in the first radix sort pass. The implicit materialization only incurs the second write.

5 In fact, the IPC of the actual timed section foo() is much lower: about 0.15 based on my calculations. The reported IPC of the entire process is an average of the IPC of the timed section and the initialization and accumulation code before and after which has a much higher IPC.

6 Notably, this is different from a how a dependent-load latency bound workflow scales: a load that is doing random read but can only have one load in progress because each load depends on the result of last scales very well to multiple cores because the serial nature of the loads doesn't use many downstream resources (but such loads can conceptually also be sped up even on a single core by changing the core loop to handle more than one dependent load stream in parallel).

这篇关于缓存友好的离线随机读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆