为什么处理未排序数组的速度与使用现代x86-64 clang处理已排序数组的速度相同? [英] Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

查看:62
本文介绍了为什么处理未排序数组的速度与使用现代x86-64 clang处理已排序数组的速度相同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现了这个流行的〜9岁的

I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.

所以,我有AMD Ryzen 9 5950X,clang ++ 10和Linux,我从问题中复制粘贴了代码,这是我得到的:

So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:

排序-0.549702s :

~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    std::sort(data, data + arraySize);
0.549702
sum = 314931600000

未排序-0.546554秒:

~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    // std::sort(data, data + arraySize);
0.546554
sum = 314931600000

我很确定,未排序版本的速度提高了3ms只是一个噪音,但是看来它不再变得更慢了.

I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.

那么, CPU的体系结构发生了什么变化(这样就不会再降低一个数量级了)?

So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?

以下是多次运行的结果:

Here are results from multiple runs:

Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted:   0.542587 0.559719 0.53938  0.557909

以防万一,这是我的main.cpp:

Just in case, here is my main.cpp:

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    // std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
    return 0;
}

更新

元素数量较多(627680):

With larger number of elements (627680):

Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    // std::sort(data, data + arraySize);
10.3814

Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    std::sort(data, data + arraySize);
10.6885

我认为这个问题仍然有意义-几乎没有区别.

I think the question is still relevant - almost no difference.

推荐答案

您链接的问题中的几个答案都涉及将代码重写为无分支的,从而避免了任何分支预测问题.这就是您更新后的编译器正在执行的操作.

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.

具体来说,带有 -O3 的clang ++ 10 矢量化内部循环.参见Godbolt上的代码,程序集的第36-67行.代码有点复杂,但是您肯定看不到的是 data [c]> = 128 测试中的任何条件分支.取而代之的是,它使用向量比较指令( pcmpgtd ),其输出是一个掩码,其中1个用于匹配元素,0个用于不匹配.带有该掩码的后续 pand 将不匹配的元素替换为0,以便在无条件地将它们相加时不做任何贡献.

Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.

相当于C ++的大致水平

The rough C++ equivalent would be

sum += data[c] & -(data[c] >= 128);

对于数组的偶数和奇数元素,代码实际上保留了两个运行的64位 sum s,因此它们可以并行累加,然后在循环结束时加在一起

The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.

一些额外的复杂性是要注意将32位 data 元素符号扩展为64位;这就是像 pxor xmm5,xmm5之类的序列;pcmpgtd xmm5,xmm4;punpckldq xmm4,xmm5 完成.打开 -mavx2 ,您会看到一个更简单的 vpmovsxdq ymm5,xmm5 .

Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.

代码看起来也很长,因为循环已经展开,每次迭代处理8个 data 元素.

The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

这篇关于为什么处理未排序数组的速度与使用现代x86-64 clang处理已排序数组的速度相同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆