64位取代32位的循环计数变量引入疯狂表现差 [英] Replacing a 32-bit loop count variable with 64-bit introduces crazy performance deviations

查看:572
本文介绍了64位取代32位的循环计数变量引入疯狂表现差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在找最快的方式 popcount 大型数据阵列。我遇到的很奇怪的效果:从改变循环变量无符号 uint64_t中制作的业绩下降了50%,我的电脑上。

I was looking for the fastest way to popcount large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC.

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

如你所见,我们创建随机数据缓冲区,与大小为 X 兆字节,其中 X 是在命令行中读出。然后,我们遍历缓冲区,并使用86 popcount 内在执行popcount的展开版本。为了获得更precise结果,我们做popcount 10,000次。我们衡量的popcount时代。在上层的情况下,内部循环变量是无符号,在较低的情况下,内循环变量是 uint64_t中。我想,这应该没有什么区别,但相反的情况。

As you see, we create a buffer of random data, with the size being x megabytes where x is read from the command line. Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount. To get a more precise result, we do the popcount 10,000 times. We measure the times for the popcount. In the upper case, the inner loop variable is unsigned, in the lower case, the inner loop variable is uint64_t. I thought that this should make no difference, but the opposite is the case.

我编译它像这样(G ++版本:Ubuntu的4.8.2-19ubuntu1):

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):

g++ -O3 -march=native -std=c++11 test.cpp -o test

下面是我的 Haswell的睿i7-4770K CPU @ 3.50&NBSP; GHz的运行测试1 (所以1 NBSP; MB随机数):

Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random data):


  • 无符号41959360000 0.401554秒 26.113&NBSP; GB / s的

  • uint64_t中41959360000 0.759822秒 13.8003&NBSP; GB / s的

  • unsigned 41959360000 0.401554 sec 26.113 GB/s
  • uint64_t 41959360000 0.759822 sec 13.8003 GB/s

正如你所见,的 uint64_t中版本吞吐量的只有一半中的一个无符号版本!这个问题似乎是,不同的装配被生成,但为什么呢?首先,我想到了一个编译器的bug,所以我试图铛++ (Ubuntu的版本3.4-1ubuntu3):

As you see, the throughput of the uint64_t version is only half the one of the unsigned version! The problem seems to be that different assembly gets generated, but why? First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3):

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

结果:测试1


  • 无符号41959360000 0.398293秒 26.3267 GB / s的

  • uint64_t中41959360000 0.680954秒 15.3986 GB / s的

  • unsigned 41959360000 0.398293 sec 26.3267 GB/s
  • uint64_t 41959360000 0.680954 sec 15.3986 GB/s

因此​​,它是几乎相同的结果,仍然是陌生。的但现在它变得超级奇怪的我更换一个从输入,读取缓冲区大小恒定 1 ,所以我改变:

So, it is almost the same result and is still strange. But now it gets super strange. I replace the buffer size that was read from input with a constant 1, so I change:

uint64_t size = atol(argv[1]) << 20;

uint64_t size = 1 << 20;

因此​​,编译器现在知道在编译时的缓冲区大小。也许它可以添加一些优化!下面是数字 G ++


  • 无符号41959360000 0.509156秒 20.5944&NBSP; GB / s的

  • uint64_t中41959360000 0.508673秒 20.6139&NBSP; GB / s的

  • unsigned 41959360000 0.509156 sec 20.5944 GB/s
  • uint64_t 41959360000 0.508673 sec 20.6139 GB/s

现在,这两个版本都一样快。但是,无符号 得到更慢!从 26 跌至 20 GB / s的,从而取代非恒定由一个恒定值导致了< STRONG>去优化即可。说真的,我不知道是什么在这里发生了!但现在到铛++ 新版本:

Now, both versions are equally fast. However, the unsigned got even slower! It dropped from 26 to 20 GB/s, thus replacing a non-constant by a constant value lead to a deoptimization. Seriously, I have no clue what is going on here! But now to clang++ with the new version:


  • 无符号41959360000 0.677009秒 15.4884&NBSP; GB / s的

  • uint64_t中41959360000 0.676909秒 15.4906&NBSP; GB / s的

  • unsigned 41959360000 0.677009 sec 15.4884 GB/s
  • uint64_t 41959360000 0.676909 sec 15.4906 GB/s

等等,什么的现在,这两个版本降到人数的15&NBSP; GB /秒。因此,通过一个恒定值更换非恒定的甚至会导致减缓的两个对于案件锵!

Wait, what? Now, both versions dropped to the slow number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang!

我问与 Ivy Bridge的CPU 的一个同事来编译我的标杆。他得到了类似的结果,因此,它似乎并不为Haswell的。由于两种编译器在这里生产奇怪的结果,它也似乎没有成为一个编译器错误。我们没有一个AMD的CPU在这里,所以我们只能与英特尔测试。

I asked a colleague with an Ivy Bridge CPU to compile my benchmark. He got similar results, so it does not seem to be Haswell. Because two compilers produce strange results here, it also does not seem to be a compiler bug. We do not have an AMD CPU here, so we could only test with Intel.

在第一个例子(一个与蒂(的argv [1]))和前放了静态变量,即:

Take the first example (the one with atol(argv[1])) and put a static before the variable, i.e.:

static uint64_t size=atol(argv[1])<<20;

下面是我用g ++的结果:

Here are my results in g++:


  • 无符号41959360000 0.396728秒 26.4306 GB / s的

  • uint64_t中41959360000 0.509484秒 20.5811 GB / s的

  • unsigned 41959360000 0.396728 sec 26.4306 GB/s
  • uint64_t 41959360000 0.509484 sec 20.5811 GB/s

耶,另一种选择的。我们仍然有快26&NBSP;以 U32 GB / s的,但我们设法让 U64 在距离13 NBSP至少; GB /秒至20&NBSP; GB / s的版本! 在我collegue的PC中, U64 版本成为甚至比 U32 版本速度更快,产生的最快结果所有可悲的是,这仅适用于 G ++ 作品铛++ 似乎,不关心静态

Yay, yet another alternative. We still have the fast 26 GB/s with u32, but we managed to get u64 at least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the u64 version became even faster than the u32 version, yielding the fastest result of all. Sadly, this only works for g++, clang++ does not seem to care about static.

你能解释这些结果?尤其是:

Can you explain these results? Especially:


  • 哪有之间的这种差异 U32 U64

  • 将一个恒定的缓冲区大小触发如何更换非恒定的少最佳code 的?

  • 静态插入关键字如何让 U64 循环更快?速度甚至超过了我的collegue的电脑上的原始code!

  • How can there be such a difference between u32 and u64?
  • How can replacing a non-constant by a constant buffer size trigger less optimal code?
  • How can the insertion of the static keyword make the u64 loop faster? Even faster than the original code on my collegue's computer!

我知道,优化是一个棘手的领土,但是,我从来没有想过这样的小变化可能导致的 100%的差异在执行时间和像一个恒定的缓冲区大小小因素可能再次混合结果共。当然,我总是希望有一个能够popcount 26&NBSP版本; GB /秒。我能想到的唯一可靠方法复制粘贴大会这种情况下,用内联汇编。这是我能摆脱,似乎发疯的小改动编译器的唯一途径。你有什么感想?有另一种方式可靠地获得code。与大多数性能?

I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. Of course, I always want to have the version that is able to popcount 26 GB/s. The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. This is the only way I can get rid of compilers that seem to go mad on small changes. What do you think? Is there another way to reliably get the code with most performance?

下面是不同结果的拆卸:

Here is the disassembly for the various results:

26 NBSP; GB / s的版本从 G ++ / U32 /非const BUFSIZE

26 GB/s version from g++ / u32 / non-const bufsize:

0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8

13 NBSP; GB / s的版本从 G ++ / U64 /非const BUFSIZE

0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00

15 NBSP; GB / s的版本从铛++ / U64 /非const BUFSIZE

0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50

20 NBSP; GB / s的版本从 G ++ / U32&安培; U64 /常量BUFSIZE

0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68

15 NBSP; GB / s的版本从铛++ / U32&安培; U64 /常量BUFSIZE

0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0

有趣的是,最快(26&NBSP; GB /秒)版本也是最长!这似乎是使用 LEA 的唯一解决方案。有些版本中使用 JB 跳,别人用 JNE 。但除此之外,所有版本似乎是可比的。我没有看到一个100%的性能差距可能发源于,但我不是在破译大会也驾轻就熟。最慢的(13&NBSP; GB / s)的版本看起来即使是很短和良好的。任何人都可以解释一下吗?

Interestingly, the fastest (26 GB/s) version is also the longest! It seems to be the only solution that uses lea. Some versions use jb to jump, others use jne. But apart from that, all versions seem to be comparable. I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very short and good. Can anyone explain this?

不管这个问题的答案会是什么;我已经了解到,在真热循环的每个的细节能够决定的事情,似乎没有有任何关联炎热code 的甚至细节。我从来没有想过要使用什么类型的循环变量,但正如你看到这样一个微小的变化能做出的 100%的区别!即使是一个缓冲器的存储类型可以产生巨大的变化,因为我们与插入看到了大小变量前面静态关键字!在未来,书写紧实和热循环是对系统性能的关键时,我会一直在测试各种编译器的各种替代方案。

No matter what the answer to this question will be; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code. I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static keyword in front of the size variable! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance.

有趣的也是性能差别还是那么高,虽然我已经循环展开了四次。所以,即使你解开,你仍然可以得到由主要的性能偏差击中。挺有意思的。

The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. So even if you unroll, you can still get hit by major performance deviations. Quite interesting.

推荐答案

罪魁祸首:虚假数据依赖(和编译器甚至不知道它)

Culprit: False Data Dependency (and the compiler isn't even aware of it)

在桑迪/ Ivy Bridge的和Haswell的处理器,指令:

On Sandy/Ivy Bridge and Haswell processors, the instruction:

popcnt src, dest

似乎对目标寄存器 DEST 虚假的依赖。虽然指令只写它,指令将等到 DEST 是在执行前准备就绪。

appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing.

这依赖不只是从一个单一的循环迭代托起4 POPCNT 秒。它可以跨越循环迭代进行使它不可能对处理器进行并行不同的循环迭代。

This dependency doesn't just hold up the 4 popcnts from a single loop iteration. It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations.

无符号 uint64_t中和其他调整不会直接影响的问题。但他们影响寄存器分配该分配寄存器变量。

The unsigned vs. uint64_t and other tweaks don't directly affect the problem. But they influence the register allocator which assigns the registers to the variables.

在你的情况下,速度是什么粘根据什么寄存器分配决定做(假)依赖链的直接结果。

In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.


  • 13 GB /秒有一个链条:POPCNT-附加POPCNT-POPCNT - >下一次迭代

  • 15 GB /秒有一个链条:POPCNT-附加POPCNT - 添加 - >下一次迭代

  • 20 GB /秒有一个链条:POPCNT-POPCNT - >下一次迭代

  • 26 GB /秒有一个链条:POPCNT-POPCNT - >下一次迭代

20 GB /秒和26 GB之间的差/ s的似乎是间接寻址的次要产物。无论哪种方式,处理器开始,一旦你达到这个速度击中其他瓶颈。

The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. Either way, the processor starts to hit other bottlenecks once you reach this speed.

要测试这一点,我用内嵌汇编绕过编译器,并得到正是我想要的装配。我也拆了计数变量来打破所有其他依赖的可能惹的基准。

To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want. I also split up the count variable to break all other dependencies that might mess with the benchmarks.

下面是结果:

的Sandy Bridge至强@ 3.5GHz的:(完整的测试code可以在底部找到)

Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)


  • GCC 4.6.3: G ++ popcnt.cpp -std =的C ++ 0x -O3 -save-临时工-march =本地

  • Ubuntu的12

不同的寄存器: 18.6195 GB / s的

.L4:
    movq    (%rbx,%rax,8), %r8
    movq    8(%rbx,%rax,8), %r9
    movq    16(%rbx,%rax,8), %r10
    movq    24(%rbx,%rax,8), %r11
    addq    $4, %rax

    popcnt %r8, %r8
    add    %r8, %rdx
    popcnt %r9, %r9
    add    %r9, %rcx
    popcnt %r10, %r10
    add    %r10, %rdi
    popcnt %r11, %r11
    add    %r11, %rsi

    cmpq    $131072, %rax
    jne .L4

相同的寄存器: 8.49272 GB / s的

.L9:
    movq    (%rbx,%rdx,8), %r9
    movq    8(%rbx,%rdx,8), %r10
    movq    16(%rbx,%rdx,8), %r11
    movq    24(%rbx,%rdx,8), %rbp
    addq    $4, %rdx

    # This time reuse "rax" for all the popcnts.
    popcnt %r9, %rax
    add    %rax, %rcx
    popcnt %r10, %rax
    add    %rax, %rsi
    popcnt %r11, %rax
    add    %rax, %r8
    popcnt %rbp, %rax
    add    %rax, %rdi

    cmpq    $131072, %rdx
    jne .L9

相同的寄存器与断链: 17.8869 GB / s的

.L14:
    movq    (%rbx,%rdx,8), %r9
    movq    8(%rbx,%rdx,8), %r10
    movq    16(%rbx,%rdx,8), %r11
    movq    24(%rbx,%rdx,8), %rbp
    addq    $4, %rdx

    # Reuse "rax" for all the popcnts.
    xor    %rax, %rax    # Break the cross-iteration dependency by zeroing "rax".
    popcnt %r9, %rax
    add    %rax, %rcx
    popcnt %r10, %rax
    add    %rax, %rsi
    popcnt %r11, %rax
    add    %rax, %r8
    popcnt %rbp, %rax
    add    %rax, %rdi

    cmpq    $131072, %rdx
    jne .L14


那么,什么地方出了错与编译器?

这似乎既不是GCC也不Visual Studio的都知道, POPCNT 有这样一个虚假的依赖。然而,这些虚假的依赖并不少见。这只是一个编译器是否意识到这一点的事情。

It seems that neither GCC, nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.

POPCNT 不正是最常用的指令。因此,这不是一个真正的惊喜,一个主要的编译器可能会错过这样的事情。似乎还没有文档,任何地方提到了这个问题。如果英特尔没有透露它,那么没有人会外面知道,直到有人跑进去偶然。

popcnt isn't exactly the most used instruction. So it's not really a surprise that a major compiler could miss something like this. There also appears to be no documentation anywhere that mentions this problem. If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance.

为什么会出现CPU有这样一个错误的依赖?

我们只能推测,但很可能是英特尔有很多双操作数指令相同的处理。像常用的指令添加有两个操作数这两者都是投入。因此,英特尔可能推 POPCNT 同一类别,以保持处理器设计简单。

We can only speculate, but it's likely that Intel has the same handling for a lot of two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.

AMD处理器没有出现有这种错误的依赖关系。

AMD processors do not appear to have this false dependency.

完整的测试code低于供参考:

The full test code is below for reference:

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

   using namespace std;
   uint64_t size=1<<20;

   uint64_t* buffer = new uint64_t[size/8];
   char* charbuffer=reinterpret_cast<char*>(buffer);
   for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256;

   uint64_t count,duration;
   chrono::time_point<chrono::system_clock> startP,endP;
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "popcnt %4, %4  \n\t"
                "add %4, %0     \n\t"
                "popcnt %5, %5  \n\t"
                "add %5, %1     \n\t"
                "popcnt %6, %6  \n\t"
                "add %6, %2     \n\t"
                "popcnt %7, %7  \n\t"
                "add %7, %3     \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "popcnt %4, %%rax   \n\t"
                "add %%rax, %0      \n\t"
                "popcnt %5, %%rax   \n\t"
                "add %%rax, %1      \n\t"
                "popcnt %6, %%rax   \n\t"
                "add %%rax, %2      \n\t"
                "popcnt %7, %%rax   \n\t"
                "add %%rax, %3      \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
                : "rax"
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "Chain 4   \t"  << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "xor %%rax, %%rax   \n\t"   // <--- Break the chain.
                "popcnt %4, %%rax   \n\t"
                "add %%rax, %0      \n\t"
                "popcnt %5, %%rax   \n\t"
                "add %%rax, %1      \n\t"
                "popcnt %6, %%rax   \n\t"
                "add %%rax, %2      \n\t"
                "popcnt %7, %%rax   \n\t"
                "add %%rax, %3      \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
                : "rax"
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "Broken Chain\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }

   free(charbuffer);
}


这是同样有趣的基准可以在这里找到: http://pastebin.com/kbzgL8si
结果
这一基准变化是在(假)依赖链 POPCNT 的位数。

False Chain 0:  41959360000 0.57748 sec     18.1578 GB/s
False Chain 1:  41959360000 0.585398 sec    17.9122 GB/s
False Chain 2:  41959360000 0.645483 sec    16.2448 GB/s
False Chain 3:  41959360000 0.929718 sec    11.2784 GB/s
False Chain 4:  41959360000 1.23572 sec     8.48557 GB/s

这篇关于64位取代32位的循环计数变量引入疯狂表现差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆