编译 32 位和 64 位时的巨大性能差异(快 26 倍) [英] Huge performance difference (26x faster) when compiling for 32 and 64 bits

查看:33
本文介绍了编译 32 位和 64 位时的巨大性能差异(快 26 倍)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图衡量在访问值类型和引用类型列表时使用 forforeach 的区别.

我使用以下类进行分析.

public static class Benchmarker{公共静态无效配置文件(字符串描述,整数迭代,动作函数){Console.Write(描述);//暖身功能();秒表 watch = new Stopwatch();//清理GC.Collect();GC.WaitForPendingFinalizers();GC.Collect();看.开始();for (int i = 0; i <迭代次数; i++){功能();}手表.停止();Console.WriteLine(" 平均时间:{0} 毫秒", watch.Elapsed.TotalMilliseconds/迭代次数);}}

我使用 double 作为我的值类型.我创建了这个假类"来测试引用类型:

class DoubleWrapper{公共双值{得到;放;}公共 DoubleWrapper(双值){价值 = 价值;}}

最后我运行了这段代码并比较了时间差异.

static void Main(string[] args){整数大小 = 1000000;整数迭代计数 = 100;var valueList = new List(size);for (int i = 0; i < size; i++)valueList.Add(i);var refList = new List(size);for (int i = 0; i < size; i++)refList.Add(new DoubleWrapper(i));双假人;Benchmarker.Profile("valueList for: ", iterationCount, () =>{双倍结果 = 0;for (int i = 0; i < valueList.Count; i++){未经检查{var temp = valueList[i];结果 *= 温度;结果 += 温度;结果/= 温度;结果 -= 温度;}}虚拟 = 结果;});Benchmarker.Profile("valueList foreach:", iterationCount, () =>{双倍结果 = 0;foreach (var v in valueList){var temp = v;结果 *= 温度;结果 += 温度;结果/= 温度;结果 -= 温度;}虚拟 = 结果;});Benchmarker.Profile("refList for: ", iterationCount, () =>{双倍结果 = 0;for (int i = 0; i < refList.Count; i++){未经检查{var temp = refList[i].Value;结果 *= 温度;结果 += 温度;结果/= 温度;结果 -= 温度;}}虚拟 = 结果;});Benchmarker.Profile("refList foreach:", iterationCount, () =>{双倍结果 = 0;foreach (var v in refList){未经检查{var temp = v.Value;结果 *= 温度;结果 += 温度;结果/= 温度;结果 -= 温度;}}虚拟 = 结果;});安全退出();}

我选择了 ReleaseAny CPU 选项,运行程序并得到以下时间:

valueList for: 平均时间:483,967938 msvalueList foreach:平均时间:477,873079 msrefList for:平均时间:490,524197 msrefList foreach:平均时间:485,659557 毫秒完毕!

然后我选择了 Release 和 x64 选项,运行程序并得到以下时间:

valueList for:平均时间:16,720209 msvalueList foreach:平均时间:15,953483 毫秒refList for:平均时间:19,381077 msrefList foreach:平均时间:18,636781 毫秒完毕!

为什么 x64 位版本要快得多?我预计会有一些不同,但不会这么大.

我无法访问其他计算机.你能在你的机器上运行这个并告诉我结果吗?我使用的是 Visual Studio 2015 并且我有一个英特尔

64 位代码(结构相同,除法快):

尽管使用了 SSE 指令,但它没有矢量化.

I was trying to measure the difference of using a for and a foreach when accessing lists of value types and reference types.

I used the following class to do the profiling.

public static class Benchmarker
{
    public static void Profile(string description, int iterations, Action func)
    {
        Console.Write(description);

        // Warm up
        func();

        Stopwatch watch = new Stopwatch();

        // Clean up
        GC.Collect();
        GC.WaitForPendingFinalizers();
        GC.Collect();

        watch.Start();
        for (int i = 0; i < iterations; i++)
        {
            func();
        }
        watch.Stop();

        Console.WriteLine(" average time: {0} ms", watch.Elapsed.TotalMilliseconds / iterations);
    }
}

I used double for my value type. And I created this 'fake class' to test reference types:

class DoubleWrapper
{
    public double Value { get; set; }

    public DoubleWrapper(double value)
    {
        Value = value;
    }
}

Finally I ran this code and compared the time differences.

static void Main(string[] args)
{
    int size = 1000000;
    int iterationCount = 100;

    var valueList = new List<double>(size);
    for (int i = 0; i < size; i++) 
        valueList.Add(i);

    var refList = new List<DoubleWrapper>(size);
    for (int i = 0; i < size; i++) 
        refList.Add(new DoubleWrapper(i));

    double dummy;

    Benchmarker.Profile("valueList for: ", iterationCount, () =>
    {
        double result = 0;
        for (int i = 0; i < valueList.Count; i++)
        {
             unchecked
             {
                 var temp = valueList[i];
                 result *= temp;
                 result += temp;
                 result /= temp;
                 result -= temp;
             }
        }
        dummy = result;
    });

    Benchmarker.Profile("valueList foreach: ", iterationCount, () =>
    {
        double result = 0;
        foreach (var v in valueList)
        {
            var temp = v;
            result *= temp;
            result += temp;
            result /= temp;
            result -= temp;
        }
        dummy = result;
    });

    Benchmarker.Profile("refList for: ", iterationCount, () =>
    {
        double result = 0;
        for (int i = 0; i < refList.Count; i++)
        {
            unchecked
            {
                var temp = refList[i].Value;
                result *= temp;
                result += temp;
                result /= temp;
                result -= temp;
            }
        }
        dummy = result;
    });

    Benchmarker.Profile("refList foreach: ", iterationCount, () =>
    {
        double result = 0;
        foreach (var v in refList)
        {
            unchecked
            {
                var temp = v.Value;
                result *= temp;
                result += temp;
                result /= temp;
                result -= temp;
            }
        }

        dummy = result;
    });

    SafeExit();
}

I selected Release and Any CPU options, ran the program and got the following times:

valueList for:  average time: 483,967938 ms
valueList foreach:  average time: 477,873079 ms
refList for:  average time: 490,524197 ms
refList foreach:  average time: 485,659557 ms
Done!

Then I selected Release and x64 options, ran the program and got the following times:

valueList for:  average time: 16,720209 ms
valueList foreach:  average time: 15,953483 ms
refList for:  average time: 19,381077 ms
refList foreach:  average time: 18,636781 ms
Done!

Why is x64 bit version so much faster? I expected some difference, but not something this big.

I do not have access to other computers. Could you please run this on your machines and tell me the results? I'm using Visual Studio 2015 and I have an Intel Core i7 930.

Here's the SafeExit() method, so you can compile/run by yourself:

private static void SafeExit()
{
    Console.WriteLine("Done!");
    Console.ReadLine();
    System.Environment.Exit(1);
}

As requested, using double? instead of my DoubleWrapper:

Any CPU

valueList for:  average time: 482,98116 ms
valueList foreach:  average time: 478,837701 ms
refList for:  average time: 491,075915 ms
refList foreach:  average time: 483,206072 ms
Done!

x64

valueList for:  average time: 16,393947 ms
valueList foreach:  average time: 15,87007 ms
refList for:  average time: 18,267736 ms
refList foreach:  average time: 16,496038 ms
Done!

Last but not least: creating a x86 profile gives me almost the same results of using Any CPU.

解决方案

I can reproduce this on 4.5.2. No RyuJIT here. Both x86 and x64 disassemblies look reasonable. Range checks and so on are the same. The same basic structure. No loop unrolling.

x86 uses a different set of float instructions. The performance of these instructions seems to be comparable with the x64 instructions except for the division:

  1. The 32 bit x87 float instructions use 10 byte precision internally.
  2. Extended precision division is super slow.

The division operation makes the 32 bit version extremely slow. Uncommenting the division equalizes performance to a large degree (32 bit down from 430ms to 3.25ms).

Peter Cordes points out that the instruction latencies of the two floating point units are not that dissimilar. Maybe some of the intermediate results are denormalized numbers or NaN. These might trigger a slow path in one of the units. Or, maybe the values diverge between the two implementations because of 10 byte vs. 8 byte float precision.

Peter Cordes also points out that all intermediate results are NaN... Removing this problem (valueList.Add(i + 1) so that no divisor is zero) mostly equalizes the results. Apparently, the 32 bit code does not like NaN operands at all. Let's print some intermediate values: if (i % 1000 == 0) Console.WriteLine(result);. This confirms that the data is now sane.

When benchmarking you need to benchmark a realistic workload. But who would have thought that an innocent division can mess up your benchmark?!

Try simply summing the numbers to get a better benchmark.

Division and modulo are always very slow. If you modify the BCL Dictionary code to simply not use the modulo operator to compute the bucket index performance measurable improves. This is how slow division is.

Here's the 32 bit code:

64 bit code (same structure, fast division):

This is not vectorized despite SSE instructions being used.

这篇关于编译 32 位和 64 位时的巨大性能差异(快 26 倍)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆