为什么处理排序数组要比处理未排序数组快? [英] Why is processing a sorted array faster than processing an unsorted array?

查看:93
本文介绍了为什么处理排序数组要比处理未排序数组快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一段C ++代码,显示了一些非常特殊的行为.出于某些奇怪的原因,奇迹般地对数据进行排序使代码快了将近六倍:

 #include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}
 

  • 没有std::sort(data, data + arraySize);,代码将在11.54秒内运行.
  • 使用排序后的数据,代码将在1.93秒内运行.

最初,我认为这可能只是一种语言或编译器异常,所以我尝试了Java:

 import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
        {
            // Primary loop
            for (int c = 0; c < arraySize; ++c)
            {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}
 

具有相似但不太极端的结果.


我首先想到的是排序将数据带入缓存,但是后来我想到这是多么愚蠢,因为刚刚生成了数组.

  • 发生了什么事?
  • 为什么处理排序数组要比处理未排序数组快?

该代码总结了一些独立的术语,因此顺序无关紧要.

解决方案

您是分支预测的受害者失败.


什么是分支预测?

考虑铁路枢纽:

Mecanismo的 图片,通过Wikimedia Commons.在 CC-By-SA 3.0 许可下使用.

现在为了争辩,假设这是在1800年代-在进行长距离或无线电通信之前.

您是一个路口的运营商,并且听到火车驶来.您不知道应该走哪条路.您停下火车,询问驾驶员他们想要哪个方向.然后您适当地设置开关.

火车很重,并且惯性很大.因此,它们要花很多时间才能启动和减速.

有更好的方法吗?您猜火车会往哪个方向走!

  • 如果您猜对了,它将继续.
  • 如果您猜错了,机长会停下来,后退并大喊大叫,以拨动开关.然后,它可以沿着其他路径重新启动.

如果您每次都猜对了,那么火车将永远都不必停下来.
如果您经常猜错了,那么火车将花费大量时间停止,备份和重新启动.


考虑一个if语句:在处理器级别,它是一个分支指令:

您是处理器,您会看到一个分支.您不知道它将走哪条路.你做什么工作?您停止执行并等待之前的指令完成.然后您沿着正确的路径继续.

现代处理器非常复杂,而且流水线较长.因此,他们花了很多时间来热身"和慢下来".

有更好的方法吗?您猜分支会往哪个方向!

  • 如果您猜对了,就继续执行.
  • 如果您猜错了,则需要刷新管道并回滚到分支.然后,您可以沿着其他路径重新启动.

如果您每次都猜对了,执行将永远不会停止.
如果您经常猜错了,则会花费大量时间来拖延,回滚和重新启动.


这是分支预测.我承认这不是最好的类比,因为火车可以只用一个标志来指示方向.但是在计算机中,处理器直到最后一刻才知道分支的方向.

那么,您如何从战略上猜测如何将火车必须倒退并沿着另一条路走的次数减至最少?您看看过去的历史!如果火车有99%的时间向左行驶,那么您就猜到了.如果它交替出现,那么您将交替猜测.如果它每三回​​去一次,您就会猜到...

换句话说,您尝试识别模式并遵循该模式. 这或多或少是分支预测变量的工作方式.

大多数应用程序具有行为良好的分支.因此,现代分支预测器通常将达到90%以上的命中率.但是,当面对没有可识别模式的不可预测分支时,分支预测变量实际上是无用的.

进一步阅读:Wikipedia上的分支预测器"文章.


从上面暗示,罪魁祸首是这个if陈述:

if (data[c] >= 128)
    sum += data[c];

请注意,数据在0到255之间均匀分布.对数据进行排序时,大约前一半的迭代将不会进入if语句.之后,他们都会进入if语句.

这对分支预测器非常友好,因为分支连续多次向同一方向移动.即使是一个简单的饱和计数器也可以正确预测分支,除了在切换方向后进行几次迭代之外.

快速可视化:

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

但是,当数据完全随机时,分支预测器将变得无用,因为它无法预测随机数据.因此,可能会有大约50%的错误预测(比随机猜测更好).

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, 133, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T,   N  ...

       = TTNTTTTNTNNTTTN ...   (completely random - hard to predict)


那该怎么办?

如果编译器无法将分支优化为有条件迁移,那么如果您愿意牺牲可读性以提高性能,则可以尝试一些破解.

替换:

if (data[c] >= 128)
    sum += data[c];

具有:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

这消除了分支,并用一些按位操作将其替换.

(请注意,此hack并不严格等同于原始的if语句.但是在这种情况下,它对于data[]的所有输入值均有效.)

基准:Core i7 920 @ 3.5 GHz

C ++-Visual Studio 2010-x64版本

//  Branch - Random
seconds = 11.777

//  Branch - Sorted
seconds = 2.352

//  Branchless - Random
seconds = 2.564

//  Branchless - Sorted
seconds = 2.587

Java-NetBeans 7.1.1 JDK 7-x64

//  Branch - Random
seconds = 10.93293813

//  Branch - Sorted
seconds = 5.643797077

//  Branchless - Random
seconds = 3.113581453

//  Branchless - Sorted
seconds = 3.186068823

观察:

  • 使用分支机构:已排序和未排序的数据之间存在巨大差异.
  • 使用Hack:排序后的数据和未排序后的数据没有区别.
  • 在C ++情况下,对数据进行排序时,hack实际上要比分支慢一点.

一般的经验法则是避免在关键循环中(例如在此示例中)避免依赖数据的分支.


更新:

    在x64上具有-O3-ftree-vectorize
  • GCC 4.6.1能够产生条件移动.因此,排序和未排序的数据之间没有区别-两者都很快速.

    (或有些快:对于已经排序的情况,cmov可能会更慢,尤其是如果GCC将其放置在关键路径上而不只是add上,尤其是在Broadwell之前的英特尔,其中cmov具有2个周期的延迟: gcc优化标志-O3使代码的运行速度比-O2慢)

  • 即使在/Ox下,VC ++ 2010也无法为此分支生成条件移动.

  • 英特尔C ++编译器(ICC)11起到了神奇的作用.它互换两个循环,从而将不可预测的分支提升到外部循环.因此,它不仅可以避免错误预测,而且还比VC ++和GCC生成的速度快两倍!换句话说,ICC利用测试循环击败了基准测试...

  • 如果您给Intel编译器提供无分支的代码,它将直接对其进行向量化...并且与分支(具有循环交换)的速度一样快.

这表明,即使是成熟的现代编译器,其优化代码的能力也可能存在巨大差异……

Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data miraculously makes the code almost six times faster:

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}

  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • With the sorted data, the code runs in 1.93 seconds.

Initially, I thought this might be just a language or compiler anomaly, so I tried Java:

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
        {
            // Primary loop
            for (int c = 0; c < arraySize; ++c)
            {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

With a similar but less extreme result.


My first thought was that sorting brings the data into the cache, but then I thought how silly that was because the array was just generated.

  • What is going on?
  • Why is processing a sorted array faster than processing an unsorted array?

The code is summing up some independent terms, so the order should not matter.

解决方案

You are a victim of branch prediction fail.


What is Branch Prediction?

Consider a railroad junction:

Image by Mecanismo, via Wikimedia Commons. Used under the CC-By-SA 3.0 license.

Now for the sake of argument, suppose this is back in the 1800s - before long distance or radio communication.

You are the operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.

Trains are heavy and have a lot of inertia. So they take forever to start up and slow down.

Is there a better way? You guess which direction the train will go!

  • If you guessed right, it continues on.
  • If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.

If you guess right every time, the train will never have to stop.
If you guess wrong too often, the train will spend a lot of time stopping, backing up, and restarting.


Consider an if-statement: At the processor level, it is a branch instruction:

You are a processor and you see a branch. You have no idea which way it will go. What do you do? You halt execution and wait until the previous instructions are complete. Then you continue down the correct path.

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

Is there a better way? You guess which direction the branch will go!

  • If you guessed right, you continue executing.
  • If you guessed wrong, you need to flush the pipeline and roll back to the branch. Then you can restart down the other path.

If you guess right every time, the execution will never have to stop.
If you guess wrong too often, you spend a lot of time stalling, rolling back, and restarting.


This is branch prediction. I admit it's not the best analogy since the train could just signal the direction with a flag. But in computers, the processor doesn't know which direction a branch will go until the last moment.

So how would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.

Most applications have well-behaved branches. So modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

Further reading: "Branch predictor" article on Wikipedia.


As hinted from above, the culprit is this if-statement:

if (data[c] >= 128)
    sum += data[c];

Notice that the data is evenly distributed between 0 and 255. When the data is sorted, roughly the first half of the iterations will not enter the if-statement. After that, they will all enter the if-statement.

This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.

Quick visualization:

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

However, when the data is completely random, the branch predictor is rendered useless, because it can't predict random data. Thus there will probably be around 50% misprediction (no better than random guessing).

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, 133, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T,   N  ...

       = TTNTTTTNTNNTTTN ...   (completely random - hard to predict)


So what can be done?

If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.

Replace:

if (data[c] >= 128)
    sum += data[c];

with:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

This eliminates the branch and replaces it with some bitwise operations.

(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[].)

Benchmarks: Core i7 920 @ 3.5 GHz

C++ - Visual Studio 2010 - x64 Release

//  Branch - Random
seconds = 11.777

//  Branch - Sorted
seconds = 2.352

//  Branchless - Random
seconds = 2.564

//  Branchless - Sorted
seconds = 2.587

Java - NetBeans 7.1.1 JDK 7 - x64

//  Branch - Random
seconds = 10.93293813

//  Branch - Sorted
seconds = 5.643797077

//  Branchless - Random
seconds = 3.113581453

//  Branchless - Sorted
seconds = 3.186068823

Observations:

  • With the Branch: There is a huge difference between the sorted and unsorted data.
  • With the Hack: There is no difference between sorted and unsorted data.
  • In the C++ case, the hack is actually a tad slower than with the branch when the data is sorted.

A general rule of thumb is to avoid data-dependent branching in critical loops (such as in this example).


Update:

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

    (Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

  • Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

  • If you give the Intel compiler the branchless code, it just out-right vectorizes it... and is just as fast as with the branch (with the loop interchange).

This goes to show that even mature modern compilers can vary wildly in their ability to optimize code...

这篇关于为什么处理排序数组要比处理未排序数组快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆