使用嵌套向量与扁平向量包装器时,行为异常 [英] Using nested vectors vs a flatten vector wrapper, strange behaviour

查看:120
本文介绍了使用嵌套向量与扁平向量包装器时,行为异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

很长一段时间以来,我的印象是,使用嵌套的std::vector<std::vector...>来模拟N维数组通常是不好的,因为不能保证内存是连续的,并且可能会有高速缓存未命中.我认为最好使用平面向量并从多个维度映射到1D,反之亦然.因此,我决定对其进行测试(代码在末尾列出).这非常简单,我定时对嵌套3D矢量和我自己的1D矢量3D包装器进行读写.我在-O3优化打开的情况下,同时使用g++clang++编译了代码.对于每次运行,我都会更改尺寸,因此我可以很好地了解其行为.令我惊讶的是,这些是我在机器MacBook Pro(Retina,13英寸,2012年末),2.5GHz i5、8GB RAM,OS X 10.10.5上获得的结果:

For a long time I had the impression that using a nested std::vector<std::vector...> for simulating an N-dimensional array is in general bad, since the memory is not guarantee to be contiguous, and one may have cache misses. I thought it's better to use a flat vector and map from multiple dimensions to 1D and vice versa. So, I decided to test it (code listed at the end). It is pretty straightforward, I timed reading/writing to a nested 3D vector vs my own 3D wrapper of an 1D vector. I compiled the code with both g++ and clang++, with -O3 optimization turned on. For each run I changed the dimensions, so I can get a pretty good idea about the behaviour. To my surprise, these are the results I obtained on my machine MacBook Pro (Retina, 13-inch, Late 2012), 2.5GHz i5, 8GB RAM, OS X 10.10.5:

g ++ 5.2

dimensions       nested   flat
X   Y   Z        (ms)     (ms) 

100 100 100  ->  16       24
150 150 150  ->  58       98
200 200 200  ->  136     308
250 250 250  ->  264     746
300 300 300  ->  440    1537

clang ++(LLVM 7.0.0)

dimensions       nested   flat
X   Y   Z        (ms)     (ms) 

100 100 100  ->  16       18
150 150 150  ->  53       61
200 200 200  ->  135     137
250 250 250  ->  255     271
300 300 300  ->  423     477


如您所见,扁平化"包装器从未击败嵌套版本.此外,与libc ++实现相比,g ++的libstdc ++实现的性能非常差,例如,对于300 x 300 x 300,扁平版本的速度几乎比嵌套版本慢4倍. libc ++似乎具有同等的性能.

As you can see, the "flatten" wrapper is never beating the nested version. Moreover, g++'s libstdc++ implementation performs quite badly compared to libc++ implementation, for example for 300 x 300 x 300 the flatten version is almost 4 times slower than the nested version. libc++ seems to have equal performance.

我的问题:

  1. 为什么拼合版本不更快?不是吗我在测试代码中遗漏了什么吗?
  2. 此外,为什么在使用扁平向量时g ++的libstdc ++的性能如此差?再次,它不应该表现更好吗?

我使用的代码:

#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <random>
#include <vector>

// Thin wrapper around flatten vector
template<typename T>
class Array3D
{
    std::size_t _X, _Y, _Z;
    std::vector<T> _vec;
public:
    Array3D(std::size_t X, std::size_t Y, std::size_t Z):
        _X(X), _Y(Y), _Z(Z), _vec(_X * _Y * _Z) {}
    T& operator()(std::size_t x, std::size_t y, std::size_t z)
    {
        return _vec[z * (_X * _Y) + y * _X + x];
    }
    const T& operator()(std::size_t x, std::size_t y, std::size_t z) const
    {
        return _vec[z * (_X * _Y) + y * _X + x];
    }
};

int main(int argc, char** argv)
{
    std::random_device rd{};
    std::mt19937 rng{rd()};
    std::uniform_real_distribution<double> urd(-1, 1);

    const std::size_t X = std::stol(argv[1]);
    const std::size_t Y = std::stol(argv[2]);
    const std::size_t Z = std::stol(argv[3]);


    // Standard library nested vector
    std::vector<std::vector<std::vector<double>>>
        vec3D(X, std::vector<std::vector<double>>(Y, std::vector<double>(Z)));

    // 3D wrapper around a 1D flat vector
    Array3D<double> vec1D(X, Y, Z);

    // TIMING nested vectors
    std::cout << "Timing nested vectors...\n";
    auto start = std::chrono::steady_clock::now();
    volatile double tmp1 = 0;
    for (std::size_t x = 0 ; x < X; ++x)
    {
        for (std::size_t y = 0 ; y < Y; ++y)
        {
            for (std::size_t z = 0 ; z < Z; ++z)
            {
                vec3D[x][y][z] = urd(rng);
                tmp1 += vec3D[x][y][z];
            }
        }
    }
    std::cout << "\tSum: " << tmp1 << std::endl; // we make sure the loops are not optimized out
    auto end = std::chrono::steady_clock::now();
    std::cout << "Took: ";
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << ms << " milliseconds\n";

    // TIMING flatten vector
    std::cout << "Timing flatten vector...\n";
    start = std::chrono::steady_clock::now();
    volatile double tmp2 = 0;
    for (std::size_t x = 0 ; x < X; ++x)
    {
        for (std::size_t y = 0 ; y < Y; ++y)
        {
            for (std::size_t z = 0 ; z < Z; ++z)
            {
                vec1D(x, y, z) = urd(rng);
                tmp2 += vec1D(x, y, z);
            }
        }
    }
    std::cout << "\tSum: " << tmp2 << std::endl; // we make sure the loops are not optimized out
    end = std::chrono::steady_clock::now();
    std::cout << "Took: ";
    ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << ms << " milliseconds\n";
}

编辑

更改Array3D<T>::operator()返回

return _vec[(x * _Y + y) * _Z + z];

按照 @ 1201ProgramAlarm的建议确实摆脱了g ++的怪异"行为,平面版本和嵌套版本现在大致需要相同的时间.但是,它仍然很吸引人.我认为嵌套缓存会由于缓存问题而变得更糟. 我可以幸运地连续分配所有的内存吗?

as per @1201ProgramAlarm's suggestion does indeed get rid of the "weird" behaviour of g++, in the sense that the flat and nested versions take now roughly the same time. However it's still intriguing. I thought the nested one will be much worse due to cache issues. May I just be lucky and have all the memory contiguously allocated?

推荐答案

为什么在固定索引顺序后嵌套矢量的速度与微基准测试中的水平速度大致相同:您希望平面阵列更快(请参见 Tobias的回答有关潜在位置问题的信息,以及

Why the nested vectors are about the same speed as flat in your microbenchmark, after fixing the indexing order: You'd expect the flat array to be faster (see Tobias's answer about potential locality problems, and my other answer for why nested vectors suck in general, but not too badly for sequential access). But your specific test is doing so many things that let out-of-order execution hide the overhead of using nested vectors, and/or that just slow things down so much that the extra overhead is lost in measurement noise.

我把你的性能bugfixed源代码高达上Godbolt,所以我们可以看看ASM 由g ++ 5.2编译的带有-O3的内部循环. (Apple的clang分支可能类似于clang3.7,但我只看一下gcc版本.)C ++函数有很多代码,但是您可以右键单击源代码行以将asm窗口滚动到该行的代码.另外,将鼠标悬停在源代码行上以加粗实现该行的汇编程序,反之亦然.

I put your performance-bugfixed source code up on Godbolt so we can look at the asm of the inner loop as compiled by g++5.2, with -O3. (Apple's fork of clang might be similar to clang3.7, but I'll just look at the gcc version.) There's a lot of code from C++ functions, but you can right-click on a source line to scroll the asm windows to the code for that line. Also, mouseover a source line to bold the asm that implements that line, or vice versa.

gcc的嵌套版本的内部两个循环如下(用手添加一些注释):

gcc's inner two loops for the nested version are as follows (with some comments added by hand):

## outer-most loop not shown

.L213:  ## middle loop (over `y`)
    test    rbp, rbp        # Z
    je      .L127           # inner loop runs zero times if Z==0
    mov     rax, QWORD PTR [rsp+80]   # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D]
    xor     r15d, r15d        # z = 0
    mov     rax, QWORD PTR [rax+r12]  # MEM[(struct vector * *)_195], MEM[(struct vector * *)_195]
    mov     rdx, QWORD PTR [rax+rbx]  # D.103857, MEM[(double * *)_38]

## Top of inner-most loop.
.L128:
    lea     rdi, [rsp+5328]   # tmp511,   ## function arg: pointer to the RNG object, which is a local on the stack.
    lea     r14, [rdx+r15*8]  # D.103851,  ## r14 = &(vec3D[x][y][z])
    call    double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&)  #
    addsd   xmm0, xmm0    # D.103853, D.103853  ## return val *= 2.0: [0.0, 2.0]
    mov     rdx, QWORD PTR [rsp+80]   # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D]   ## redo the pointer-chasing from vec3D.data()
    mov     rdx, QWORD PTR [rdx+r12]  # MEM[(struct vector * *)_150], MEM[(struct vector * *)_150]
    subsd   xmm0, QWORD PTR .LC6[rip]     # D.103859, ## and subtract 1.0:  [-1.0, 1.0]
    mov     rdx, QWORD PTR [rdx+rbx]  # D.103857, MEM[(double * *)_27]
    movsd   QWORD PTR [r14], xmm0 # *_155, D.103859        # store into vec3D[x][y][z]
    movsd   xmm0, QWORD PTR [rsp+64]      # D.103853, tmp1  # reload volatile tmp1
    addsd   xmm0, QWORD PTR [rdx+r15*8]   # D.103853, *_62  # add the value just stored into the array (r14 = rdx+r15*8 because nothing else modifies the pointers in the outer vectors)
    add     r15, 1    # z,
    cmp     rbp, r15  # Z, z
    movsd   QWORD PTR [rsp+64], xmm0      # tmp1, D.103853  # spill tmp1
    jne     .L128     #,
 #End of inner-most loop

.L127:  ## middle-loop
    add     r13, 1    # y,
    add     rbx, 24           # sizeof(std::vector<> == 24) == the size of 3 pointers.
    cmp     QWORD PTR [rsp+8], r13    # %sfp, y
    jne     .L213     #,

 ## outer loop not shown.

对于扁平循环:

 ## outer not shown.
.L214:
    test    rbp, rbp        # Z
    je      .L135       #,
    mov     rax, QWORD PTR [rsp+280]  # D.103849, vec1D._Y
    mov     rdi, QWORD PTR [rsp+288]  # D.103849, vec1D._Z
    xor     r15d, r15d        # z
    mov     rsi, QWORD PTR [rsp+296]  # D.103857, MEM[(double * *)&vec1D + 24B]

.L136:  ## inner-most loop
    imul    rax, r12        # D.103849, x
    lea     rax, [rax+rbx]    # D.103849,
    imul    rax, rdi        # D.103849, D.103849
    lea     rdi, [rsp+5328]   # tmp520,
    add     rax, r15  # D.103849, z
    lea     r14, [rsi+rax*8]  # D.103851,       # &vec1D(x,y,z)
    call    double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&)  #
    mov     rax, QWORD PTR [rsp+280]  # D.103849, vec1D._Y
    addsd   xmm0, xmm0    # D.103853, D.103853
    mov     rdi, QWORD PTR [rsp+288]  # D.103849, vec1D._Z
    mov     rsi, QWORD PTR [rsp+296]  # D.103857, MEM[(double * *)&vec1D + 24B]
    mov     rdx, rax  # D.103849, D.103849
    imul    rdx, r12        # D.103849, x       # redo address calculation a 2nd time per iteration
    subsd   xmm0, QWORD PTR .LC6[rip]     # D.103859,
    add     rdx, rbx  # D.103849, y
    imul    rdx, rdi        # D.103849, D.103849
    movsd   QWORD PTR [r14], xmm0 # MEM[(double &)_181], D.103859  # store into the address calculated earlier
    movsd   xmm0, QWORD PTR [rsp+72]      # D.103853, tmp2
    add     rdx, r15  # tmp374, z
    add     r15, 1    # z,
    addsd   xmm0, QWORD PTR [rsi+rdx*8]   # D.103853, MEM[(double &)_170]   # tmp2 += vec1D(x,y,z).  rsi+rdx*8 == r14, so this is a reload of the store this iteration.
    cmp     rbp, r15  # Z, z
    movsd   QWORD PTR [rsp+72], xmm0      # tmp2, D.103853
    jne     .L136     #,

.L135:  ## middle loop: increment y
    add     rbx, 1    # y,
    cmp     r13, rbx  # Y, y
    jne     .L214     #,

 ## outer loop not shown.


您的MacBook Pro(2012年末)具有 Intel IvyBridge CPU,因此我我正在使用 Agner Fog的说明表和微架构指南中的数字来表示该微架构.在其他Intel/AMD CPU上,情况应该大致相同.


Your MacBook Pro (Late 2012) has an Intel IvyBridge CPU, so I'm using numbers for that microarchitecture from Agner Fog's instruction tables and microarch guide. Things should be mostly the same on other Intel/AMD CPUs.

i5-3210M是唯一的2.5GHz移动IvB i5,因此您的CPU具有3MiB的L3缓存.这意味着即使最小的测试用例(每个double〜= 7.63MiB 100 ^ 3 * 8B)也比上一级缓存大,所以根本没有测试用例适合缓存.这可能是一件好事,因为在测试任何一个之前,您都要对嵌套和平面进行分配和默认初始化.但是,您会按照分配的顺序进行测试,因此,如果将平面数组清零后嵌套数组仍在缓存中,则在嵌套数组上进行定时循环后,平面数组在L3缓存中可能仍然很热.

The only 2.5GHz mobile IvB i5 is the i5-3210M, so your CPU has 3MiB of L3 cache. This means even your smallest test case (100^3 * 8B per double ~= 7.63MiB) is larger than your last-level cache, so none of your test cases fit in cache at all. That's probably a good thing, because you allocate and default-initialize both nested and flat before testing either of them. However, you do test in the same order you allocate, so if the nested array is still cache after zeroing the flat array, the flat array may still be hot in L3 cache after the timing loop over the nested array.

如果您使用重复循环在同一阵列上多次循环,则可能会有足够大的时间来测量较小的阵列大小.

If you'd used a repeat-loop to loop over the same array multiple times, you could have got times large enough to measure for smaller array sizes.

您在这里所做的几件事情很奇怪,并且使它如此缓慢,以至于乱序执行可以隐藏更改y的额外延迟,即使您的内部z向量不是完全连续的

You're doing several things here that are super-weird and make this so slow that out-of-order execution can hide the extra latency of changing y, even if your inner z vectors are not perfectly contiguous.

  1. 您在定时循环内运行了缓慢的PRNG. std::uniform_real_distribution<double> urd(-1, 1);std::mt19937 rng{rd()};之上的额外开销,与FP-add延迟(3个周期)相比已经很慢,或与每个周期2个的L1D缓存负载吞吐量进行比较.运行PRNG的所有这些额外时间使乱序执行有机会运行数组索引指令,因此在数据准备就绪时最终地址就准备好了. 除非您有很多个缓存未命中,否则您基本上只是在测量PRNG速度,因为它产生的结果每个时钟周期都比1慢得多.

  1. You run a slow PRNG inside the timed loop. std::uniform_real_distribution<double> urd(-1, 1); is extra overhead on top of std::mt19937 rng{rd()};, which is already slow compared to FP-add latency (3 cycles), or compared to the L1D cache load throughput of 2 per cycle. All this extra time running the PRNG gives out-of-order execution a chance to run the array-indexing instructions so the final address is ready by the time the data is. Unless you have a lot of cache misses, you're mostly just measuring PRNG speed, because it produces results much slower than 1 per clock cycle.

g ++ 5.2不能完全内联urd(rng)代码,并且x86-64 System V调用约定没有保留调用的XMM寄存器.因此,即使不是volatile,每个元素都必须溢出/重新加载tmp1/tmp2.

g++5.2 doesn't fully inline the urd(rng) code, and the x86-64 System V calling convention has no call-preserved XMM registers. So tmp1/tmp2 have to be spilled/reloaded for every element, even if they weren't volatile.

它也失去了在Z向量中的位置,并且必须重做外部2级间接访问,然后才能访问下一个z元素.这是因为它不知道其调用函数的内部,并且假定它可能具有指向外部vector<>内存的指针. (平面版本在内部循环中进行了两次乘法以建立索引,而不是简单的指针添加.)

It also loses its place in the Z vector, and has to redo the outer 2 levels of indirection before accessing the next z element. This is because it doesn't know about the internals of the function its calling, and assumes that it might have a pointer to the outer vector<>'s memory. (The flat version does two multiplies for indexing in the inner loop, instead of a simple pointer-add.)

clang(使用libc ++)确实完全内联了PRNG,所以移动到下一个z只是add reg, 8,以增加平面版本和嵌套版本中的指针.通过在内部循环之外获取迭代器或获取对内部向量的引用,可以从gcc获得相同的行为,而不必重做operator[]并希望编译器为您提升它.

clang (with libc++) does fully inline the PRNG, so moving to the next z is just add reg, 8 to increment a pointer in both the flat and nested versions. You could get the same behaviour from gcc by getting an iterator outside the inner loop, or getting a reference to the inner vector, instead of redoing operator[] and hoping the compiler will hoist it for you.

除非正常值外,Intel/AMD FP的add/sub/mul吞吐率/延迟与数据无关. ( x87也会变慢向下表示NaN并可能表示无穷大,但是SSE不会.64位代码甚至对标量float/double都使用SSE.)因此,您可以将数组初始化为零或PRNG.超出了定时循环. (或者将它们置零,因为vector<double>构造函数为您完成了此操作,并且在要编写其他内容的情况下,实际上需要花费额外的代码才能得到 not .)除法和sqrt性能取决于某些CPU上的数据,并且比add/sub/mul慢得多.

Intel/AMD FP add/sub/mul throughput/latency is not data-dependent, except for denormals. (x87 also slows down for NaN and maybe infinity, but SSE doesn't. 64-bit code uses SSE even for scalar float/double.) So you could have just initialized your array with zeros, or with a PRNG outisde the timing loop. (Or left them zeroed, since the vector<double> constructor does that for you, and it actually takes extra code to get it not to in cases where you're going to write something else.) Division and sqrt performance is data-dependent on some CPUs, and much slower than add/sub/mul.

您要在内循环内的每个元素紧挨着写一个 .在源代码中,这看起来像是存储/重新加载.不幸的是,这就是gcc的实际工作,但是使用libc ++的clang(内嵌PRNG)会改变循环体:

You write every element right before you read it, inside the inner loop. In the source, this looks like a store/reload. That's what gcc actually does, unfortunately, but clang with libc++ (which inlines the PRNG) transforms the loop body:

       // original
       vec3D[x][y][z] = urd(rng);
       tmp1 += vec3D[x][y][z];

       // what clang's asm really does
       double xmm7 = urd(rng);  
       vec3D[x][y][z] = xmm7;
       tmp1 += xmm7;

在c的asm中:

                               # do { ...

addsd   xmm7, xmm4             # last instruction of the PRNG
movsd   qword ptr [r8], xmm7   # store it into the Z vector
addsd   xmm7, qword ptr [rsp + 88]
add     r8, 8                  # pointer-increment to walk along the Z vector
dec     r13                    # i--
movsd   qword ptr [rsp + 88], xmm7
jne     .LBB0_74               # }while(i != 0);

之所以可以这样做,是因为vec3D不是volatileatomic<>,因此任何其他线程同时写入此内存将是未定义的行为.这意味着它可以将存储/重新加载到内存中的对象的操作优化为仅存储(只需使用存储的值,而无需重新加载).或者,如果可以证明它是死存储,则可以完全优化存储(存储没有任何东西可以读取,例如,未使用的static变量).

It's allowed to do this because vec3D isn't volatile or atomic<>, so it would be undefined behaviour for any other thread to be writing this memory at the same time. This means it can optimize away store/reloads to objects in memory into just a store (and simply use the value it stored, without reloading). Or optimize the store away entirely if it can prove it's a dead store (a store that nothing can ever read, e.g. to an unused static variable).

在gcc版本中,它将在PRNG调用之前为商店建立索引,之后为重新加载进行索引.因此,我认为gcc不能确定函数调用不会修改指针,因为指向外部向量的指针已转义了该函数. (而且PRNG不内联).

In gcc's version, it does the indexing for the store before the PRNG call, and the indexing for the reload after. So I think gcc isn't sure that the function call doesn't modify a pointer, because pointers to the outer vectors have escaped the function. (And the PRNG doesn't inline).

但是,即使是在asm中进行实际存储/重新加载,对缓存丢失的敏感度也仍然比简单加载要低!

存储->负载转发仍然有效.因此,Z向量中的缓存未命中不会直接延迟关键路径.如果乱序执行无法掩盖缓存未命中的延迟,则只会减慢您的速度. (只要将数据写入到存储缓冲区中,存储区就可以退出(并且所有先前的指令都已退出).我不确定在高速缓存行甚至到达L1D之前,负载是否可以退出)它的数据来自存储转发.由于x86确实允许StoreLoad重新排序(允许存储在加载后变为全局可见),因此可能能够这样做.在这种情况下,存储/重新加载仅会为PRNG结果增加6个周期的延迟(偏离从一个PRNG状态到下一个PRNG状态的关键路径),并且只有高速缓存是一个瓶颈,如果它缓存不足以至于存储缓冲区被填满并阻止执行新的存储操作,从而最终阻止了新的操作当预留站或ROB填满未执行或未退休的(分别)微指令时,会发出乱序的内核.

Store->load forwarding still works even if the store misses in cache. So a cache-miss in a Z vector doesn't directly delay the critical path. It only slows you down if out-of-order execution can't hide the latency of the cache miss. (A store can retire as soon as the data is written to the store-buffer (and all previous instructions have retired). I'm not sure if a load can retire before the cache-line even makes it to L1D, if it got its data from store-forwarding. It may be able to, because x86 does allow StoreLoad reordering (stores are allowed to become globally visible after loads). In that case, a store/reload only adds 6 cycles of latency for the PRNG result (off the critical path from one PRNG state to next PRNG state). And it's only throughput a bottleneck if it cache-misses so much that the store-buffer fills up and prevents new store uops from executing, which in turn eventually prevents new uops from issuing into the out-of-order core when the Reservation Station or ROB fills up with un-executed or un-retired (respectively) uops.

使用反向索引(平面代码的原始版本),可能的主要瓶颈是分散的存储. IDK为什么clang比gcc在那里做得好得多.也许clang毕竟设法设法使循环反转并按顺序遍历内存. (因为它完全内联了PRNG,所以没有需要内存状态与程序顺序匹配的函数调用.)

With reversed indexing (original version of the flat code), probably the main bottleneck was the scattered stores. IDK why clang did so much better than gcc there. Maybe clang managed to invert a loop and traverse memory in sequential order after all. (Since it fully inlined the PRNG, there were no function calls that would require the state of memory to match program-order.)

按顺序遍历每个Z向量意味着高速缓存未命中之间的距离相对较远(即使每个Z向量与上一个向量都不连续),也为存储执行提供了很多时间.或者,即使在L1D高速缓存实际上拥有高速缓存行之前(在MESI协议的修改"状态下),实际上不能撤消存储转发的负载,推测性执行也具有正确的数据,而不必等待延迟的延迟.高速缓存未命中.乱序的指令窗口可能足够大,可以防止关键路径在负载退出之前停滞. (高速缓存未命中负载通常是非常糟糕的,因为依赖指令无法在没有数据的情况下执行,因此它们无法运行.因此,它们更容易在管道中产生气泡.DRAM的完全高速缓存未命中具有超过300的延迟周期,并且IvB上的乱序窗口为168 oups,它无法隐藏每个时钟甚至1 uop(大约1条指令)执行代码的所有延迟.)对于纯存储,乱码订单窗口超出了ROB的大小,因为他们不需要提交给L1D即可退休.实际上,他们不能直到退休后才提交,因为这就是他们被认为是非投机性的观点. (因此,使它们早于全局可见性可以防止在检测到异常或错误推测时发生回滚.)

Traversing each Z vector in-order means that cache misses are relatively far-between (even if each Z vector is not contiguous with the previous one), giving lots of time for for the stores to execute. Or even if a store-forwarded load can't actually retire until the L1D cache actually owns the cache line (in Modified state of the MESI protocol), speculative execution has the correct data and didn't have to wait for the latency of the cache-miss. The out-of-order instruction window is probably big enough to keep the critical path probably from stalling before the load can retire. (Cache miss loads are normally really bad, because dependent instructions can't be executed without data for them to operate on. So they much more easily create bubbles in the pipeline. With a full cache-miss from DRAM having a latency of over 300 cycles, and the out-of-order window being 168 uops on IvB, it can't hide all of the latency for code executing at even 1 uop (approximately 1 instruction) per clock.) For pure stores, the out-of-order window extends beyond the ROB size, because they don't need to commit to L1D to retire. In fact, they can't commit until after they retire, because that's the point at which they're known to be non-speculative. (So making them globally-visible earlier than that would prevent roll-back on detection of an exception or mis-speculation.)

我的桌面上没有安装libc++,因此我无法针对g ++对该版本进行基准测试.使用g ++ 5.4,我发现Nested:225毫秒和Flat:239毫秒.我怀疑额外的数组索引乘法是一个问题,并与PRNG使用的ALU指令竞争.相反,嵌套版本重做L1D缓存中命中的一堆指针追逐可以并行发生.我的台式机是4.4GHz的Skylake i7-6700k. SKL的ROB(重新订购缓冲区)大小为224微克,RS为97微克,因此无序窗口非常大.它还具有4个周期的FP添加延迟(与之前的3个uarch不同).

I don't have libc++ installed on my desktop, so I can't benchmark that version against g++. With g++5.4, I'm finding Nested: 225 milliseconds and Flat: 239 milliseconds. I suspect that the extra array-indexing multiplies are a problem, and compete with ALU instructions the PRNG uses. In contrast, the nested version redoing a bunch of pointer-chasing that hits in L1D cache can happen in parallel. My desktop is a Skylake i7-6700k at 4.4GHz. SKL has a ROB (ReOrder Buffer) size of 224 uops, and RS of 97 uops, so the out-of-order window is very large. It also has FP-add latency of 4 cycles (unlike previous uarches where it was 3).

volatile double tmp1 = 0; 您的累加器为volatile,这会迫使编译器在每次内部循环迭代时都对其进行存储/重新加载.内部循环是9个周期:3个用于addsd,6个用于将存储从movsd转发到movsd重装. (clang用addsd xmm7, qword ptr [rsp + 88]将重载折叠到内存操作数中,但有相同的区别.([rsp+88]在堆栈中,如果需要从寄存器中溢出,则会存储具有自动存储的变量.)

volatile double tmp1 = 0; Your accumulator is volatile, which forces the compiler to store/reload it every iteration of the inner loop. The total latency of the loop-carried dependency chain in the inner loop is 9 cycles: 3 for addsd, and 6 for store-forwarding from movsd the store to the movsd reload. (clang folds the reload into a memory operand with addsd xmm7, qword ptr [rsp + 88], but same difference. ([rsp+88] is on the stack, where variables with automatic storage are stored, if they need to be spilled from registers.)

如上所述,对gcc的非内联函数调用还将在x86-64 System V调用约定(Windows以外的所有程序都使用)中强制溢出/重新加载.但是,例如,一个聪明的编译器本可以执行4个PRNG调用,然后执行4个数组存储. (如果使用迭代器来确保gcc知道包含其他向量的向量没有变化.)

As noted above, the non-inline function call for gcc will also force a spill/reload in the x86-64 System V calling convention (used by everything but Windows). But a smart compiler could have done 4 PRNG calls, for example, and then 4 array stores. (If you'd used an iterator to make sure gcc knew the vectors holding other vectors weren't changing.)

使用-ffast-math将使编译器自动矢量化(如果不是PRNG和volatile).这将使您足够快地遍历数组,以至于不同Z向量之间缺乏局部性可能是一个真正的问题.它还会让编译器与多个累加器一起展开,以隐藏FP添加延迟.例如他们可以(和铛会)使asm等效于:

Using -ffast-math would have let the compiler auto-vectorize (if not for the PRNG and volatile). This would let you run through the arrays fast enough that lack of locality between different Z vectors could be a real problem. It would also let compilers unroll with multiple accumulators, to hide the FP-add latency. e.g. they could (and clang would) make asm equivalent to:

float t0=0, t1=0, t2=0, t3=0;
for () {
   t0 += a[i + 0];
   t1 += a[i + 1];
   t2 += a[i + 2];
   t3 += a[i + 3];
}
t0 = (t0 + t1) + (t2 + t3);

那有4个独立的依赖链,因此它可以使4个FP添加处于运行状态.由于IvB具有3个周期延迟,对于addsd,每个时钟吞吐量只有1个,因此我们只需要保持4个运行中就可以饱和其吞吐量. (Skylake具有4c延迟,每时钟吞吐量2,与mul或FMA相同,因此您需要8个累加器来避免延迟瓶颈.实际上,如问题的质询人所表明的那样,当接近最大负载吞吐量时,Haswell在使用更多累加器时表现更好.)

That has 4 separate dependency chains, so it can keep 4 FP adds in flight. Since IvB has 3 cycle latency, one per clock throughput for addsd, we only need to keep 4 in flight to saturate its throughput. (Skylake has 4c latency, 2 per clock throughput, same as mul or FMA, so you need 8 accumulators to avoid latency bottlenecks. Actually, even more is better. As testing by the asker of that question showed, Haswell did better with even more accumulators when coming close to maxing out load throughput.)

这样的方法可以更好地测试遍历Array3D的效率. 如果您想使循环完全停止优化,请使用结果.测试您的微基准测试,以确保增加问题的大小可以缩短时间;如果不是这样,那么某些东西就会被优化,或者您没有测试自己认为要测试的东西.不要将内循环设为临时volatile !!

Something like that would be a much better test of how efficient it is to loop over an Array3D. If you want to stop the loop from being optimized away entirely, just use the result. Test your microbenchmark to make sure that increasing the problem size scales the time; if not then something got optimized away, or you're not testing what you think you're testing. Don't make the inner-loop temporary volatile!!

编写微基准测试并不容易.您必须了解足够的知识,才能编写出可以测试您认为正在测试的内容的软件. :P这是一个容易出错的很好的例子.

Writing microbenchmarks is not easy. You have to understand enough to write one that tests what you think you're testing. :P This is a good example of how easy it is to go wrong.

我可以幸运地连续分配所有的内存吗?

May I just be lucky and have all the memory contiguously allocated?

是的,当您在执行之前没有分配和释放任何东西时,许多按顺序完成的小分配可能会发生这种情况.如果它们足够大(通常是一个4kiB页或更大),则glibc malloc将切换为使用mmap(MAP_ANONYMOUS),然后内核将选择随机虚拟地址(

Yes, that probably happens for many small allocations done in-order, when you haven't allocated and freed anything before doing this. If they were large enough (usually one 4kiB page or larger), glibc malloc would switch to using mmap(MAP_ANONYMOUS) and then the kernel would choose randomized virtual addresses (ASLR). So with larger Z, you might expect locality to get worse. But on the other hand, larger Z vectors mean you spend more of your time looping over one contiguous vector so a cache miss when changing y (and x) becomes relatively less important.

顺序地用您的数据循环显然不会暴露这一点,因为额外的指针访问会在高速缓存中命中,因此指针追逐具有足够低的延迟,使得OOO的执行可以用您的慢循环将其隐藏.

Looping sequentially over your data with your apparently doesn't expose this, because the extra pointer accesses hit in cache, so the pointer-chasing has low enough latency for OOO execution to hide it with your slow loop.

Prefetch在这里停留很容易.

Prefetch has a really easy time keeping up here.

不同的编译器/库可以使这个奇怪的测试有很大的不同.在我的系统(Arch Linux,具有4.4 GHz最大Turbo的i7-6700k Skylake)上,对于c ++ 5.4 -O3,在300 300 300上运行4次的最佳结果是:

Different compilers / library can make a big difference with this weird test. On my system (Arch Linux, i7-6700k Skylake with 4.4GHz max turbo), the best of 4 runs at 300 300 300 for g++5.4 -O3 was:

Timing nested vectors...
        Sum: 579.78
Took: 225 milliseconds
Timing flatten vector...
        Sum: 579.78
Took: 239 milliseconds

 Performance counter stats for './array3D-gcc54 300 300 300':

    532.066374      task-clock (msec)         #    1.000 CPUs utilized          
             2      context-switches          #    0.004 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
        54,523      page-faults               #    0.102 M/sec                  
 2,330,334,633      cycles                    #    4.380 GHz                    
 7,162,855,480      instructions              #    3.07  insn per cycle         
   632,509,527      branches                  # 1188.779 M/sec                  
       756,486      branch-misses             #    0.12% of all branches        

   0.532233632 seconds time elapsed

vs. g ++ 7.1 -O3(显然决定分支到g ++ 5.4没有的东西)

vs. g++7.1 -O3 (which apparently decided to branch on something that g++5.4 didn't)

Timing nested vectors...
        Sum: 932.159
Took: 363 milliseconds
Timing flatten vector...
        Sum: 932.159
Took: 378 milliseconds

 Performance counter stats for './array3D-gcc71 300 300 300':

    810.911200      task-clock (msec)         #    1.000 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
        54,523      page-faults               #    0.067 M/sec                  
 3,546,467,563      cycles                    #    4.373 GHz                    
 7,107,511,057      instructions              #    2.00  insn per cycle         
   794,124,850      branches                  #  979.299 M/sec                  
    55,074,134      branch-misses             #    6.94% of all branches        

   0.811067686 seconds time elapsed

vs. clang4.0 -O3(带有gcc的libstdc ++,而不是libc ++)

vs. clang4.0 -O3 (with gcc's libstdc++, not libc++)

perf stat ./array3D-clang40-libstdc++ 300 300 300
Timing nested vectors...
        Sum: -349.786
Took: 1657 milliseconds
Timing flatten vector...
        Sum: -349.786
Took: 1631 milliseconds

 Performance counter stats for './array3D-clang40-libstdc++ 300 300 300':

   3358.297093      task-clock (msec)         #    1.000 CPUs utilized          
             9      context-switches          #    0.003 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
        54,521      page-faults               #    0.016 M/sec                  
14,679,919,916      cycles                    #    4.371 GHz                    
12,917,363,173      instructions              #    0.88  insn per cycle         
 1,658,618,144      branches                  #  493.887 M/sec                  
       916,195      branch-misses             #    0.06% of all branches        

   3.358518335 seconds time elapsed

我没有深入研究clang的错误之处,也没有尝试使用-ffast-math和/或-march=native. (不过,除非您删除volatile,否则它们不会做很多事情.)

I didn't dig into what clang did wrong, or try with -ffast-math and/or -march=native. (Those won't do much unless you remove volatile, though.)

perf stat -d没有显示出比gcc更多的clang缓存未命中(L1或最后一级).但是它确实表明clang所做的L1D负载是其两倍多.

perf stat -d doesn't show more cache misses (L1 or last-level) for clang than gcc. But it does show clang is doing more than twice as many L1D loads.

我确实尝试使用非正方形数组.几乎完全相同,同时总元素数保持不变,但是将最终尺寸更改为5或6.

I did try with a non-square array. It's almost exactly the same time keeping total element count the same but changing the final dimension to 5 or 6.

即使对C进行很小的更改也可以使扁平化"比使用gcc嵌套快(对于300 ^ 3,它从240ms降低到220ms,但是对于嵌套几乎没有任何区别.):

Even a minor change to the C helps, and makes "flatten" faster than nested with gcc (from 240ms down to 220ms for 300^3, but barely making any difference for nested.):

 // vec1D(x, y, z) = urd(rng);
    double res = urd(rng);
    vec1D(x, y, z) = res;   // indexing calculation only done once, after the function call
    tmp2 += vec1D(x, y, z);
    // using iterators would still avoid redoing it at all.

这篇关于使用嵌套向量与扁平向量包装器时,行为异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆