为什么处理排序数组比处理未排序数组更快? [英] Why is processing a sorted array faster than processing an unsorted array?

查看:31
本文介绍了为什么处理排序数组比处理未排序数组更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一段 C++ 代码,显示了一些非常奇特的行为.出于某种奇怪的原因,对数据进行排序(在定时区域之前)奇迹般地使循环快了近六倍.

#include <算法>#include <ctime>#include int main(){//生成数据const unsigned arraySize = 32768;整数数据[数组大小];for (unsigned c = 0; c = 128)总和 += 数据 [c];}}double elapsedTime = static_cast(clock()-start)/CLOCKS_PER_SEC;std::cout <<elapsedTime<<'
';std::cout <<总和 ="<<总和<<'
';}

  • 如果没有 std::sort(data, data + arraySize);,代码运行时间为 11.54 秒.
  • 使用排序后的数据,代码运行时间为 1.93 秒.

(排序本身比遍历数组需要更多的时间,因此如果我们需要为未知数组计算它实际上不值得这样做.)


一开始,我认为这可能只是语言或编译器异常,所以我尝试了 Java:

import java.util.Arrays;导入 java.util.Random;公共课主要{public static void main(String[] args){//生成数据int arraySize = 32768;int data[] = new int[arraySize];随机 rnd = 新随机(0);for (int c = 0; c .


正如上面所暗示的,罪魁祸首是这个 if 语句:

if (data[c] >= 128)总和 += 数据 [c];

注意数据在0到255之间是均匀分布的,数据排序的时候,大概前半部分的迭代不会进入if语句.之后,他们都会进入if语句.

这对分支预测器非常友好,因为分支连续多次走向相同的方向.即使是简单的饱和计数器也能正确预测分支,除非它切换方向后进行了几次迭代.

快速可视化:

T = 分支被采用N = 未采用分支数据[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...分支 = N N N N N ... N N T T T ... T T T ...= NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT(易于预测)

但是,当数据完全随机时,分支预测器就变得无用了,因为它无法预测随机数据.因此可能会有大约 50% 的错误预测(不比随机猜测好).

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118, 14, 150, 177, 182, ...分支 = T、T、N、T、T、T、T、N、T、N、N、T、T、T ...= TTNTTTTNTNNTTT ...(完全随机 - 无法预测)


可以做什么?

如果编译器无法将分支优化为有条件的移动,如果您愿意为了性能而牺牲可读性,您可以尝试一些技巧.

替换:

if (data[c] >= 128)总和 += 数据 [c];

与:

int t = (data[c] - 128) >>31;总和 += ~t &数据[c];

这消除了分支并用一些按位运算替换它.

(注意这个 hack 并不严格等同于原始的 if 语句.但在这种情况下,它对 data[] 的所有输入值都有效.)

基准:Core i7 920 @ 3.5 GHz

C++ - Visual Studio 2010 - x64 版本

<头>
场景时间(秒)
分支 - 随机数据11.777
分支 - 排序数据2.352
无分支 - 随机数据2.564
Branchless - 排序数据2.587

Java - NetBeans 7.1.1 JDK 7 - x64

<头>
场景时间(秒)
分支 - 随机数据10.93293813
分支 - 排序数据5.643797077
无分支 - 随机数据3.113581453
Branchless - 排序数据3.186068823

观察:

  • 使用 Branch:排序数据和未排序数据之间存在巨大差异.
  • 使用 Hack:排序数据和未排序数据之间没有区别.
  • 在 C++ 的情况下,当数据排序时,hack 实际上比分支慢一点.

一般的经验法则是在关键循环中避免依赖于数据的分支(例如在本例中).


更新:

  • GCC 4.6.1 with -O3-ftree-vectorize on x64 能够生成条件移动,因此排序之间没有区别和未排序的数据 - 两者都很快.

    (或者有点快:对于已经排序的情况,cmov 可能会更慢,特别是如果 GCC 将它放在关键路径上而不是 add,尤其是在英特尔上cmov 有 2 个周期延迟的 Broadwell 之前:gcc 优化标志 -O3 使代码比 -O2 慢)

  • 即使在 /Ox 下,VC++ 2010 也无法为此分支生成条件移动.

  • 英特尔 C++ 编译器 (ICC) 11 做了一些神奇的事情.它交换两个循环,从而将不可预测的分支提升到外循环.它不仅不受错误预测的影响,而且速度是 VC++ 和 GCC 生成的速度的两倍!换句话说,ICC 利用测试循环击败了基准测试......

  • 如果您为英特尔编译器提供无分支代码,它会直接对其进行矢量化...并且与分支(使用循环交换)一样快.

这表明,即使是成熟的现代编译器在优化代码的能力方面也会有很大差异...

Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data (before the timed region) miraculously makes the loop almost six times faster.

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;
    for (unsigned i = 0; i < 100000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {   // Primary loop
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << '
';
    std::cout << "sum = " << sum << '
';
}

  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • With the sorted data, the code runs in 1.93 seconds.

(Sorting itself takes more time than this one pass over the array, so it's not actually worth doing if we needed to calculate this for an unknown array.)


Initially, I thought this might be just a language or compiler anomaly, so I tried Java:

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;
        for (int i = 0; i < 100000; ++i)
        {
            for (int c = 0; c < arraySize; ++c)
            {   // Primary loop
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

With a similar but less extreme result.


My first thought was that sorting brings the data into the cache, but then I thought how silly that was because the array was just generated.

  • What is going on?
  • Why is processing a sorted array faster than processing an unsorted array?

The code is summing up some independent terms, so the order should not matter.


Related / followup Q&As about the same effect with different / later compilers and options:

解决方案

You are a victim of branch prediction fail.


What is Branch Prediction?

Consider a railroad junction:

Image by Mecanismo, via Wikimedia Commons. Used under the CC-By-SA 3.0 license.

Now for the sake of argument, suppose this is back in the 1800s - before long-distance or radio communication.

You are the operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.

Trains are heavy and have a lot of inertia, so they take forever to start up and slow down.

Is there a better way? You guess which direction the train will go!

  • If you guessed right, it continues on.
  • If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.

If you guess right every time, the train will never have to stop.
If you guess wrong too often, the train will spend a lot of time stopping, backing up, and restarting.


Consider an if-statement: At the processor level, it is a branch instruction:

You are a processor and you see a branch. You have no idea which way it will go. What do you do? You halt execution and wait until the previous instructions are complete. Then you continue down the correct path.

Modern processors are complicated and have long pipelines. This means they take forever to "warm up" and "slow down".

Is there a better way? You guess which direction the branch will go!

  • If you guessed right, you continue executing.
  • If you guessed wrong, you need to flush the pipeline and roll back to the branch. Then you can restart down the other path.

If you guess right every time, the execution will never have to stop.
If you guess wrong too often, you spend a lot of time stalling, rolling back, and restarting.


This is branch prediction. I admit it's not the best analogy since the train could just signal the direction with a flag. But in computers, the processor doesn't know which direction a branch will go until the last moment.

How would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.

Most applications have well-behaved branches. Therefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

Further reading: "Branch predictor" article on Wikipedia.


As hinted from above, the culprit is this if-statement:

if (data[c] >= 128)
    sum += data[c];

Notice that the data is evenly distributed between 0 and 255. When the data is sorted, roughly the first half of the iterations will not enter the if-statement. After that, they will all enter the if-statement.

This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.

Quick visualization:

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

However, when the data is completely random, the branch predictor is rendered useless, because it can't predict random data. Thus there will probably be around 50% misprediction (no better than random guessing).

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T  ...

       = TTNTTTTNTNNTTT ...   (completely random - impossible to predict)


What can be done?

If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.

Replace:

if (data[c] >= 128)
    sum += data[c];

with:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

This eliminates the branch and replaces it with some bitwise operations.

(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[].)

Benchmarks: Core i7 920 @ 3.5 GHz

C++ - Visual Studio 2010 - x64 Release

Scenario Time (seconds)
Branching - Random data 11.777
Branching - Sorted data 2.352
Branchless - Random data 2.564
Branchless - Sorted data 2.587

Java - NetBeans 7.1.1 JDK 7 - x64

Scenario Time (seconds)
Branching - Random data 10.93293813
Branching - Sorted data 5.643797077
Branchless - Random data 3.113581453
Branchless - Sorted data 3.186068823

Observations:

  • With the Branch: There is a huge difference between the sorted and unsorted data.
  • With the Hack: There is no difference between sorted and unsorted data.
  • In the C++ case, the hack is actually a tad slower than with the branch when the data is sorted.

A general rule of thumb is to avoid data-dependent branching in critical loops (such as in this example).


Update:

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move, so there is no difference between the sorted and unsorted data - both are fast.

    (Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

  • Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. Not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

  • If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

This goes to show that even mature modern compilers can vary wildly in their ability to optimize code...

这篇关于为什么处理排序数组比处理未排序数组更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆