针对不同的优化莫名计时 [英] Unexplainable timings for different optimisations

查看:161
本文介绍了针对不同的优化莫名计时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写有不同的算法适用于大数据集,这取决于设置一些代码。该数据集很大,和现实世界的定时表明,我们需要的,如果可能优化这



的选择算法必须对从数据的许多子集上运行一个大阵。因此,我决定尝试几种不同的方法:




  1. Intialise一个 Func键<> 委派引用所需的算法。从主循环中调用该委托。

  2. 遍历所有的数据,并从主循环中调用相应的算法。

  3. 调用一个单独的方法它实现了主循环本身每种算法。



在我的测试中,我不得不每一种方法调用同一个底层方法,计算()。 (当然,真正的代码调用每个algorithim不同的方法,但在这里我测试调用的算法,而不是算法本身的最快方式。)



每个的测试中调用所需的算法在一个循环 ITERS 倍。



在此测试代码, DataReductionAlgorithm 只是定义了各种算法的枚举。它没有真正使用以外来模拟实际的代码会发生什么。



下面是我的方法测试实施(1)。这是非常简单的:分配 Func键<>一个来被称为算法,然后从一个循环调用它:

 私有静态无效TEST1 (INT []数据,DataReductionAlgorithm算法)
{
Func键< INT [],INT,INT,INT>一个;

开关(算法)
{
情况下DataReductionAlgorithm.Max:
A =计算;
中断;

情况下DataReductionAlgorithm.Mean:
A =计算;
中断;

默认:
A =计算;
中断;
}

的for(int i = 0; I< ITERS ++ I)
A(数据,0,data.Length);
}

下面是我的方法测试实现(2)。它移动如果测试算法的循环之外的选择。我原以为这是最快的方法:

 私有静态无效测试2(INT []数据,DataReductionAlgorithm算法)
{
开关(算法)
{
情况下DataReductionAlgorithm.Max:

的for(int i = 0; I< ITERS ++ I)
计算(数据,0,data.Length);

中断;

情况下DataReductionAlgorithm.Mean:

的for(int i = 0; I< ITERS ++ I)
计算(数据,0,data.Length );

中断;

默认:

的for(int i = 0; I< ITERS ++ I)
计算(数据,0,data.Length);

中断;
}
}



下面是测试方法的代码(3)。如果移动的如果测试算法的循环内的选择。我原以为这是比较慢的方法(2),因为如果测试将要执行 ITERS 倍,而不是仅仅一次:

 私有静态无效TEST3(INT []数据,DataReductionAlgorithm算法)
{
。对于(INT I = 0; I< ITERS ++ I)
{
开关(算法)
{
情况下DataReductionAlgorithm.Max:

计算(数据,0,data.Length);
中断;

情况下DataReductionAlgorithm.Mean:

计算(数据,0,data.Length);
中断;

默认:

计算(数据,0,data.Length);
中断;
}
}
}

由于奇怪的时序结果我得到,我添加了一个新的测试,这几乎等同于 test2的()只是代替开关的情况下内循环,我所说的方法做同样的循环



因此,我预计今年采取几乎相同的时间 test2的()

 私有静态无效TEST4(INT []数据,DataReductionAlgorithm算法)
{
开关(算法)
{
情况下DataReductionAlgorithm.Max:

迭代(ITERS,数据);
中断;

情况下DataReductionAlgorithm.Mean:

迭代(ITERS,数据);
中断;

默认:

迭代(ITERS,数据);
中断;
}
}

私有静态无效迭代(INT N,INT []数据)
{
的for(int i = 0; I< N; ++ I)
计算(数据,0,data.Length);
}

下面是整个程序,如果有人想尝试一下自己:

 使用系统;使用System.Diagnostics程序
;
使用System.Linq的;

命名空间演示
{
公共枚举DataReductionAlgorithm
{
单,
。最大,
平均
}

内线类节目
{
私人const int的ITERS = 100000;

私人无效的run()
{
INT []数据= Enumerable.Range(0,10000).ToArray();

秒表SW =新的秒表();

为(INT审判= 0;试验4; ++试行)
{
sw.Restart();
test1的(数据,DataReductionAlgorithm.Mean);
Console.WriteLine(测试1:+ sw.Elapsed);

sw.Restart();
test2的(数据,DataReductionAlgorithm.Mean);
Console.WriteLine(测试2:+ sw.Elapsed);

sw.Restart();
TEST3(数据,DataReductionAlgorithm.Mean);
Console.WriteLine(TEST3:+ sw.Elapsed);

sw.Restart();
TEST4(数据,DataReductionAlgorithm.Mean);
Console.WriteLine(TEST4:+ sw.Elapsed);

Console.WriteLine();
}
}

私有静态无效TEST1(INT []数据,DataReductionAlgorithm算法)
{
Func键< INT [],INT,INT, INT>一个;

开关(算法)
{
情况下DataReductionAlgorithm.Max:
A =计算;
中断;

情况下DataReductionAlgorithm.Mean:
A =计算;
中断;

默认:
A =计算;
中断;
}

的for(int i = 0; I< ITERS ++ I)
A(数据,0,data.Length);
}

私有静态无效测试2(INT []数据,DataReductionAlgorithm算法)
{
开关(算法)
{
的情况下DataReductionAlgorithm 。最大:

的for(int i = 0; I< ITERS ++ I)
计算(数据,0,data.Length);

中断;

情况下DataReductionAlgorithm.Mean:

的for(int i = 0; I< ITERS ++ I)
计算(数据,0,data.Length );

中断;

默认:

的for(int i = 0; I< ITERS ++ I)
计算(数据,0,data.Length);

中断;
}
}

私有静态无效TEST3(INT []数据,DataReductionAlgorithm算法)
{
的for(int i = 0; I< ITERS; ++ I)
{
开关(算法)
{
情况下DataReductionAlgorithm.Max:

计算(数据,0,data.Length );
中断;

情况下DataReductionAlgorithm.Mean:

计算(数据,0,data.Length);
中断;

默认:

计算(数据,0,data.Length);
中断;
}
}
}

私有静态无效TEST4(INT []数据,DataReductionAlgorithm算法)
{
开关(算法)
{
情况下DataReductionAlgorithm.Max:

迭代(ITERS,数据);
中断;

情况下DataReductionAlgorithm.Mean:

迭代(ITERS,数据);
中断;

默认:

迭代(ITERS,数据);
中断;
}
}

私有静态无效迭代(INT N,INT []数据)
{
的for(int i = 0; I< N; ++ I)
计算(数据,0,data.Length);
}

私有静态诠释计算(INT []数据,INT I1,I2 INT)
{
//只是一个虚拟的实现。
//使用相同的算法对每个方法,以避免在定时差别。

INT结果= 0;

的for(int i = I1; I< I2; ++ I)
结果+ =数据[I]

返回结果;
}

私有静态无效的主要()
{
新计划()的run()。
}
}
}






结果



首先,请注意,这些结果是从一个发布版本运行在调试器外部。运行调试版本 - 或者从调试器中运行一个发布版本 - 将给出误导性的结果。



我在Windows 8.1测试与.net 4.51构建一个四核心英特尔处理器。 (但是,我得到了类似的结果与.NET 4.5和.Net 4)。



我因得到不同的结果,无论是X64 /值为anycpu或x86。



要回顾一下:我期待TEST1()和TEST3()是最慢的,和TEST2()是最快的,与TEST4()是几乎相同的速度测试2 ()



下面是在x86的结果:

 测试1:00 :00:00.5892166 
测试2:00:00:00.5848795
TEST3:00:00:00.5866006
TEST4:00:00:00.5867143

这是kindof我所期待的,除了test1的()是快,我认为这将是(可能指示调用委托进行了高度优化)。



下面是x64的结果:

 测试1:00:00:00.8769743 
测试2:00:00:00.8750667
TEST3:00:00:00.5839475
TEST4:00:00:00.5853400

哇!



发生了什么事 test1的() test2的()?我无法解释这一点。如何能 test2的()那么多比 TEST3慢()



和为什么不是 TEST4()几乎相同的速度 test2的() <? / p>

为什么x86和x64之间的巨大差异?



任何人都可以揭示出这个任何光线?在速度上的差异是不显着。 - 它可能使东西之间的差用10秒钟和它服用15秒。






附录



我已经接受了下面的答案。



然而,仅仅说明通过@usr下面提到的优化JIT的脆弱性,考虑下面的代码:

 使用系统;使用System.Diagnostics程序
;

命名空间演示
{
内部类节目
{
私人const int的ITERS = 10000;

私人无效的run()
{
秒表SW =新的秒表();
INT []数据=新的INT [10000]

为(INT审判= 0;试验4; ++试行)
{
sw.Restart();
TEST1(数据,0);
VAR elapsed1 = sw.Elapsed;

sw.Restart();
TEST2(数据,0);
VAR elapsed2 = sw.Elapsed;

Console.WriteLine(比=+ elapsed1.TotalMilliseconds / elapsed2.TotalMilliseconds);
}

到Console.ReadLine();
}

私有静态无效TEST1(INT []数据,INT X)
{
开关(X)
{
的情况下0 :
{
的for(int i = 0; I< ITERS ++ I)
假(数据);

中断;
}
}
}

私有静态无效测试2(INT []数据,INT X)
{
开关(X)
{
的情况下0:
{
环路(数据);
中断;
}
}
}

私有静态诠释假人(INT []数据)
{
INT最大值= 0;

//还与尝试INT I = 1在下面的循环。

表示(INT I = 0; I&下; data.Length ++ⅰ)
如果(数据[I]≥最大值)
最大=数据[I] ;

回报率最高;
}

私有静态无效循环(INT []数据)
{
的for(int i = 0; I< ITERS ++ I)
虚设(数据);
}

私有静态无效的主要()
{
新计划()的run()。
}
}
}

请注意代码行的下面注释 //还与尝试INT I = 1在下面的循环。



通过 I = 0 ,我得到释放的x64版本以下结果:

 率= 1.52235829774506 
比率= 1.50636405328076
比率= 1.52291602053827
比率= 1.52803278744701

仅仅改变了以 I = 1 ,我得到的结果:

 比率= 1.16920209593233 
比率= 0.990370350435142
比率= 0.991150637472754
比率= 0.999941245001628

有趣! :)


解决方案

我可以重现在x64的问题,.NET 4.5,版本,没有调试



我已经看了64产生的测试2 TEST3 。热内循环消耗了99%的时间。只有这个循环的问题。



有关 TEST3 计算内联和循环边界等于数组边界。这使得消除了范围检查的JIT。在测试2 的范围检查不能被消除,因为循环边界是动态的。它们是由 INT I1,定int I2 这不是静态已知有效的数组边界。只有内联能够提供在当前JIT的信息。通过内联替换这些值0,data.Length



这并不一定是这样。该热点JVM执行动态范围检查消除。该.NET JIT并不复杂做到这一点。



TEST3 与内联:





测试2 与计算不是内联:





两支而不是一个。一个是回路测试,一个范围检查。



我不知道为什么JIT内联不同在这里。内联是由启发式驱动。


I'm writing some code which has to apply a different algorithm to a large data set, depending upon a setting. The data set is large, and real-world timings have indicated that we need to optimise this if possible.

The selected algorithm has to be run on many subsets of data from a large array. I therefore decided to try several different approaches:

  1. Intialise a Func<> delegate to reference the required algorithm. Call this delegate from within the main loop.
  2. Loop over the data and call the appropriate algorithm from within the main loop.
  3. Call a separate method which implements the main loop itself for each algorithm .

In my tests, I had each approach calling the same underlying method, calculate(). (Of course, the real code calls a different method for each algorithim, but here I am testing the quickest way of calling the algorithm, not the algorithm itself.)

Each of the tests call the required algorithm in a loop ITERS times.

In this test code, DataReductionAlgorithm is just an enum that defines the various algorithms. It's not really used other than to simulate what would happen in real code.

Here is my test implementation for approach (1). It's very straightforward: Assign the Func<> a to the algorithm to be called, then call it from a loop:

private static void test1(int[] data, DataReductionAlgorithm algorithm)
{
    Func<int[], int, int, int> a;

    switch (algorithm)
    {
        case DataReductionAlgorithm.Max:
            a = calculate;
            break;

        case DataReductionAlgorithm.Mean:
            a = calculate;
            break;

        default:
            a = calculate;
            break;
    }

    for (int i = 0; i < ITERS; ++i)
        a(data, 0, data.Length);
}

Here's my test implementation for approach (2). It moves the if test for the choice of algorithm outside the loop. I was expecting this to be the fastest approach:

private static void test2(int[] data, DataReductionAlgorithm algorithm)
{
    switch (algorithm)
    {
        case DataReductionAlgorithm.Max:

            for (int i = 0; i < ITERS; ++i)
                calculate(data, 0, data.Length);

            break;

        case DataReductionAlgorithm.Mean:

            for (int i = 0; i < ITERS; ++i)
                calculate(data, 0, data.Length);

            break;

        default:

            for (int i = 0; i < ITERS; ++i)
                calculate(data, 0, data.Length);

            break;
    }
}

Here's the code for testing approach (3). If moves the if test for the choice of algorithm inside the loop. I was expecting this to be slower that approach (2) because the if test is going to be performed ITERS times, instead of just once:

private static void test3(int[] data, DataReductionAlgorithm algorithm)
{
    for (int i = 0; i < ITERS; ++i)
    {
        switch (algorithm)
        {
            case DataReductionAlgorithm.Max:

                calculate(data, 0, data.Length);
                break;

            case DataReductionAlgorithm.Mean:

                calculate(data, 0, data.Length);
                break;

            default:

                calculate(data, 0, data.Length);
                break;
        }
    }
}

Because of the strange timing results I was getting, I added a new test which is almost identical to test2() except that instead of looping within the switch cases, I call a method to do exactly the same loop.

I therefore expected this to take almost the same time as test2():

private static void test4(int[] data, DataReductionAlgorithm algorithm)
{
    switch (algorithm)
    {
        case DataReductionAlgorithm.Max:

            iterate(ITERS, data);
            break;

        case DataReductionAlgorithm.Mean:

            iterate(ITERS, data);
            break;

        default:

            iterate(ITERS, data);
            break;
    }
}

private static void iterate(int n, int[] data)
{
    for (int i = 0; i < n; ++i)
        calculate(data, 0, data.Length);
}

Here is the entire program, in case anyone wants to try it themselves:

using System;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    public enum DataReductionAlgorithm
    {
        Single,
        Max,
        Mean
    }

    internal class Program
    {
        private const int ITERS = 100000;

        private void run()
        {
            int[] data = Enumerable.Range(0, 10000).ToArray();

            Stopwatch sw = new Stopwatch();

            for (int trial = 0; trial < 4; ++trial)
            {
                sw.Restart();
                test1(data, DataReductionAlgorithm.Mean);
                Console.WriteLine("test1: " + sw.Elapsed);

                sw.Restart();
                test2(data, DataReductionAlgorithm.Mean);
                Console.WriteLine("test2: " + sw.Elapsed);

                sw.Restart();
                test3(data, DataReductionAlgorithm.Mean);
                Console.WriteLine("test3: " + sw.Elapsed);

                sw.Restart();
                test4(data, DataReductionAlgorithm.Mean);
                Console.WriteLine("test4: " + sw.Elapsed);

                Console.WriteLine();
            }
        }

        private static void test1(int[] data, DataReductionAlgorithm algorithm)
        {
            Func<int[], int, int, int> a;

            switch (algorithm)
            {
                case DataReductionAlgorithm.Max:
                    a = calculate;
                    break;

                case DataReductionAlgorithm.Mean:
                    a = calculate;
                    break;

                default:
                    a = calculate;
                    break;
            }

            for (int i = 0; i < ITERS; ++i)
                a(data, 0, data.Length);
        }

        private static void test2(int[] data, DataReductionAlgorithm algorithm)
        {
            switch (algorithm)
            {
                case DataReductionAlgorithm.Max:

                    for (int i = 0; i < ITERS; ++i)
                        calculate(data, 0, data.Length);

                    break;

                case DataReductionAlgorithm.Mean:

                    for (int i = 0; i < ITERS; ++i)
                        calculate(data, 0, data.Length);

                    break;

                default:

                    for (int i = 0; i < ITERS; ++i)
                        calculate(data, 0, data.Length);

                    break;
            }
        }

        private static void test3(int[] data, DataReductionAlgorithm algorithm)
        {
            for (int i = 0; i < ITERS; ++i)
            {
                switch (algorithm)
                {
                    case DataReductionAlgorithm.Max:

                        calculate(data, 0, data.Length);
                        break;

                    case DataReductionAlgorithm.Mean:

                        calculate(data, 0, data.Length);
                        break;

                    default:

                        calculate(data, 0, data.Length);
                        break;
                }
            }
        }

        private static void test4(int[] data, DataReductionAlgorithm algorithm)
        {
            switch (algorithm)
            {
                case DataReductionAlgorithm.Max:

                    iterate(ITERS, data);
                    break;

                case DataReductionAlgorithm.Mean:

                    iterate(ITERS, data);
                    break;

                default:

                    iterate(ITERS, data);
                    break;
            }
        }

        private static void iterate(int n, int[] data)
        {
            for (int i = 0; i < n; ++i)
                calculate(data, 0, data.Length);
        }

        private static int calculate(int[] data, int i1, int i2)
        {
            // Just a dummy implementation.
            // Using the same algorithm for each approach to avoid differences in timings.

            int result = 0;

            for (int i = i1; i < i2; ++i)
                result += data[i];

            return result;
        }

        private static void Main()
        {
            new Program().run();
        }
    }
}


The results

Firstly, note that these results are from a RELEASE BUILD run from outside the debugger. Running a debug build - or running a release build from the debugger - will give misleading results.

I'm testing a build with .Net 4.51 on Windows 8.1 with a quad core Intel processor. (However, I got similar results with .Net 4.5 and .Net 4.)

I got different results depending on whether it was x64/AnyCPU or x86.

To recap: I was expecting test1() and test3() to be the slowest, and test2() to be the fastest, with test4() being almost the same speed as test2().

Here's the x86 results:

test1: 00:00:00.5892166
test2: 00:00:00.5848795
test3: 00:00:00.5866006
test4: 00:00:00.5867143

That's kindof what I expected, except that test1() was faster that I thought it would be (possibly indicating that calling a delegate is highly optimised).

Here's the x64 results:

test1: 00:00:00.8769743
test2: 00:00:00.8750667
test3: 00:00:00.5839475
test4: 00:00:00.5853400

Whoah!

What happened to test1() and test2()? I can't explain that. How can test2() be so much slower than test3()?

And why isn't test4() pretty much the same speed as test2()?

And why the huge difference between x86 and x64?

Can anyone shed any light on this? The difference in speed is not insignificant - it might make the difference between something taking 10 seconds and it taking 15 seconds.


ADDENDUM

I've accepted the answer below.

However, just to illustrate the fragility of JIT optimisations mentioned by @usr below, consider the following code:

using System;
using System.Diagnostics;

namespace Demo
{
    internal class Program
    {
        private const int ITERS = 10000;

        private void run()
        {
            Stopwatch sw = new Stopwatch();
            int[] data = new int[10000];

            for (int trial = 0; trial < 4; ++trial)
            {
                sw.Restart();
                test1(data, 0);
                var elapsed1 = sw.Elapsed;

                sw.Restart();
                test2(data, 0);
                var elapsed2 = sw.Elapsed;

                Console.WriteLine("Ratio = " + elapsed1.TotalMilliseconds / elapsed2.TotalMilliseconds);
            }

            Console.ReadLine();
        }

        private static void test1(int[] data, int x)
        {
            switch (x)
            {
                case 0:
                {
                    for (int i = 0; i < ITERS; ++i)
                        dummy(data);

                    break;
                }
            }
        }

        private static void test2(int[] data, int x)
        {
            switch (x)
            {
                case 0:
                {
                    loop(data);
                    break;
                }
            }
        }

        private static int dummy(int[] data)
        {
            int max = 0;

            // Also try with "int i = 1" in the loop below.

            for (int i = 0; i < data.Length; ++i)
                if (data[i] > max)
                    max = data[i];

            return max;
        }

        private static void loop(int[] data)
        {
            for (int i = 0; i < ITERS; ++i)
                dummy(data);
        }

        private static void Main()
        {
            new Program().run();
        }
    }
}

Note the line of code underneath the comment // Also try with "int i = 1" in the loop below..

With i = 0, I get the following results for release x64 build:

Ratio = 1.52235829774506
Ratio = 1.50636405328076
Ratio = 1.52291602053827
Ratio = 1.52803278744701

Merely changing that to i = 1, I get these results:

Ratio = 1.16920209593233
Ratio = 0.990370350435142
Ratio = 0.991150637472754
Ratio = 0.999941245001628

Interesting! :)

解决方案

I can reproduce the issue on x64, .NET 4.5, Release, no Debugger.

I have looked at the generated x64 for test2 and test3. The hot inner loop consumes 99% of the time. Only this loop matters.

For test3, calculate is inlined and the loop bounds are equal to the array bounds. This allows the JIT to eliminate the range check. In test2 the range check could not be eliminated because the loop bounds are dynamic. They are given by int i1, int i2 which are not statically known to be valid array bounds. Only inlining can provide that information in the current JIT. Inlining replaces these values by 0, data.Length.

This does not have to be so. The Hotspot JVM performs dynamic range check elimination. The .NET JIT is not sophisticated to do this.

test3 with inlining:

test2 with calculate not inlined:

Two branches instead of one. One is the loop test, one the range check.

I have no idea why the JIT inlined differently here. Inlining is driven by heuristics.

这篇关于针对不同的优化莫名计时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆