编译为32位和64位时,巨大的性能差异(26倍快) [英] Huge performance difference (26x faster) when compiling for 32 and 64 bits

查看:151
本文介绍了编译为32位和64位时,巨大的性能差异(26倍快)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图来衡量使用的差异和访问值类型的列表时的foreach 和引用类型。



我用下面的类来进行剖析。

 公共静态类Benchmarker 
{
公共静态无效的配置文件(字符串描述,诠释迭代,动作FUNC)
{
Console.Write(介绍);

//热身
FUNC();

秒表手表=新的秒表();

//清理
GC.Collect的();
GC.WaitForPendingFinalizers();
GC.Collect的();

watch.Start();
的for(int i = 0; I<迭代;我++)
{
FUNC();
}
watch.Stop();

Console.WriteLine(平均时间:{0}毫秒,watch.Elapsed.TotalMilliseconds /迭代);
}
}



我用双击我的值类型。
和我创造了这个假类'测试引用类型:

 类DoubleWrapper 
{
公共double值{搞定;组; }

公共DoubleWrapper(double值)
{
值=价值;
}
}



最后,我跑这些代码和比较的时间差。

 静态无效的主要(字串[] args)
{
INT大小= 1000000;
INT iterationCount = 100;

变种valueList =新的List<双>(大小);
的for(int i = 0; I<大小;我++)
valueList.Add(I)

变种refList =新的List< D​​oubleWrapper>(大小);
的for(int i = 0; I<大小;我++)
refList.Add(新DoubleWrapper(I));

双模拟;

Benchmarker.Profile(valueList为:iterationCount,()=>
{
双重结果= 0;
的for(int i = 0; I< valueList.Count;我++)
{
选中
{
变种临时= valueList [I];
结果* =气温;
结果+ =温度;
结果/ =温度;
的结果 - =温度;
}
}
哑=结果;
});

Benchmarker.Profile(valueList的foreach:,iterationCount,()=>
{
双结果= 0;
的foreach(在valueList变种v)的
{
变种临时= v;
结果* =气温;
结果+ =温度;
结果/ =温度;
的结果 - =温度;
}
哑=结果;
});

Benchmarker.Profile(refList为:iterationCount,()=>
{
双重结果= 0;
的for(int i = 0; I< refList.Count;我++)
{
选中
{
变种临时= refList [I] .value的;
结果* =气温;
结果+ =温度;
结果/ =温度;
的结果 - =温度;
}
}
哑=结果;
});

Benchmarker.Profile(refList的foreach:,iterationCount,()=>
{
双结果= 0;
的foreach(在refList变种v)的
{
选中
{
变种临时= v.Value;
结果* =气温;
结果+ =温度;
结果/ =气温;
的结果 - =温度;
}
}

哑=结果;
});

SafeExit();
}



我选择了发布任何CPU 选项,运行该程序,得到了以下时间:



<预类=郎无prettyprint,覆盖> valueList为:平均时间:483,967938毫秒
valueList的foreach:平均时间:477,873079毫秒
refList为:平均时间:490,524197毫秒
refList的foreach:平均时间:485,659557毫秒
完成!



于是我选择了发行和x64选项,运行该程序,得到了以下时间:



<预类=郎无prettyprint-覆盖> valueList为:平均时间:16,720209毫秒
valueList的foreach:平均时间:15,953483 MS
refList为:平均时间:19,381077毫秒
refList的foreach:平均时间:18,636781毫秒
完成!



为什么是64位版本的如此之快?我期望差一些,但不是这么大。



我没有访问到其他计算机。能否请你在你的机器上运行这一点,并告诉我结果如何?我使用Visual Studio 2015年和我有一个英特尔睿&NBSP;酷睿i7 930

$ B $。 b

这里的 SafeExit()方法,这样你就可以编译/自行运行:

 私有静态无效SafeExit()
{
Console.WriteLine(完成!);
到Console.ReadLine();
System.Environment.Exit(1);
}

按照要求,使用翻番?,而不是我的 DoubleWrapper 的:



任何CPU



<预类=郎无prettyprint-覆盖> valueList为:平均时间:482,98116毫秒
valueList的foreach:平均时间:478,837701毫秒
refList为:平均时间:491,075915毫秒
refList的foreach:平均时间:483,206072毫秒
完成!



64



<预类=郎无prettyprint-覆盖> valueList为:平均时间:16,393947毫秒
valueList的foreach:平均时间:15,87007毫秒
refList为:平均时间:18,267736 MS
refList的foreach:平均时间:16,496038毫秒
完成!



最后但并非最不重要的:创建一个 86 的个人资料给我几乎使用了相同的结果任何CPU


解决方案

我可以在4.5.2重现此。这里没有RyuJIT。 x86和x64的拆卸服务看起来合理。范围检查等是相同的。相同的基本结构。没有循环展开。



x86使用一组不同的浮点指令。这些指令的表现似乎与x64的指令除了为师堪比:




  1. 32位的x87浮点指令使用10字节的精度内部。

  2. 扩展精度除法是超级慢



除法运算使。 32位版本极慢。 在取消分工均衡的性能在很大程度上(从430ms到3.25ms的32位下降)。



彼得·科德斯指出指令延迟两个浮点单元是不是一样。也许一些中间结果的非正规化数或NaN。这些可能的单位中的一个触发慢路径。或者,也许值的,因为10字节与8字节浮点精度两个实现之间分歧。



彼得·科德斯的也指出的所有的中间结果为NaN ...删除此问题( valueList.Add第(i + 1)所以没有除数为零)大多是均衡的结果。显然,32位代码不喜欢数字操作数在所有。让我们来打印一些中间值:如果(I%1000 == 0)Console.WriteLine(结果); 。这证实了数据现在清醒。



在基准测试需要一个标杆实际工作量。但谁曾想到,一个无辜的分工可以搞砸你的标杆?!



尝试简单相加的数字以获得更好的基准。



司和模总是很慢。如果修改了BCL 词典代码根本就没有使用模运算符来计算斗指数可测量的改善。这是为师有多慢



下面是32位代码:





64位代码(相同的结构,快速除法):





这是的的矢量尽管正在使用SSE指令。


I was trying to measure the difference of using a for and a foreach when accessing lists of value types and reference types.

I used the following class to do the profiling.

public static class Benchmarker
{
    public static void Profile(string description, int iterations, Action func)
    {
        Console.Write(description);

        // Warm up
        func();

        Stopwatch watch = new Stopwatch();

        // Clean up
        GC.Collect();
        GC.WaitForPendingFinalizers();
        GC.Collect();

        watch.Start();
        for (int i = 0; i < iterations; i++)
        {
            func();
        }
        watch.Stop();

        Console.WriteLine(" average time: {0} ms", watch.Elapsed.TotalMilliseconds / iterations);
    }
}

I used double for my value type. And I created this 'fake class' to test reference types:

class DoubleWrapper
{
    public double Value { get; set; }

    public DoubleWrapper(double value)
    {
        Value = value;
    }
}

Finally I ran this code and compared the time differences.

static void Main(string[] args)
{
    int size = 1000000;
    int iterationCount = 100;

    var valueList = new List<double>(size);
    for (int i = 0; i < size; i++) 
        valueList.Add(i);

    var refList = new List<DoubleWrapper>(size);
    for (int i = 0; i < size; i++) 
        refList.Add(new DoubleWrapper(i));

    double dummy;

    Benchmarker.Profile("valueList for: ", iterationCount, () =>
    {
        double result = 0;
        for (int i = 0; i < valueList.Count; i++)
        {
             unchecked
             {
                 var temp = valueList[i];
                 result *= temp;
                 result += temp;
                 result /= temp;
                 result -= temp;
             }
        }
        dummy = result;
    });

    Benchmarker.Profile("valueList foreach: ", iterationCount, () =>
    {
        double result = 0;
        foreach (var v in valueList)
        {
            var temp = v;
            result *= temp;
            result += temp;
            result /= temp;
            result -= temp;
        }
        dummy = result;
    });

    Benchmarker.Profile("refList for: ", iterationCount, () =>
    {
        double result = 0;
        for (int i = 0; i < refList.Count; i++)
        {
            unchecked
            {
                var temp = refList[i].Value;
                result *= temp;
                result += temp;
                result /= temp;
                result -= temp;
            }
        }
        dummy = result;
    });

    Benchmarker.Profile("refList foreach: ", iterationCount, () =>
    {
        double result = 0;
        foreach (var v in refList)
        {
            unchecked
            {
                var temp = v.Value;
                result *= temp;
                result += temp;
                result /= temp;
                result -= temp;
            }
        }

        dummy = result;
    });

    SafeExit();
}

I selected Release and Any CPU options, ran the program and got the following times:

valueList for:  average time: 483,967938 ms
valueList foreach:  average time: 477,873079 ms
refList for:  average time: 490,524197 ms
refList foreach:  average time: 485,659557 ms
Done!

Then I selected Release and x64 options, ran the program and got the following times:

valueList for:  average time: 16,720209 ms
valueList foreach:  average time: 15,953483 ms
refList for:  average time: 19,381077 ms
refList foreach:  average time: 18,636781 ms
Done!

Why is x64 bit version so much faster? I expected some difference, but not something this big.

I do not have access to other computers. Could you please run this on your machines and tell me the results? I'm using Visual Studio 2015 and I have an Intel Core i7 930.

Here's the SafeExit() method, so you can compile/run by yourself:

private static void SafeExit()
{
    Console.WriteLine("Done!");
    Console.ReadLine();
    System.Environment.Exit(1);
}

As requested, using double? instead of my DoubleWrapper:

Any CPU

valueList for:  average time: 482,98116 ms
valueList foreach:  average time: 478,837701 ms
refList for:  average time: 491,075915 ms
refList foreach:  average time: 483,206072 ms
Done!

x64

valueList for:  average time: 16,393947 ms
valueList foreach:  average time: 15,87007 ms
refList for:  average time: 18,267736 ms
refList foreach:  average time: 16,496038 ms
Done!

Last but not least: creating a x86 profile gives me almost the same results of using Any CPU.

解决方案

I can reproduce this on 4.5.2. No RyuJIT here. Both x86 and x64 disassemblies look reasonable. Range checks and so on are the same. The same basic structure. No loop unrolling.

x86 uses a different set of float instructions. The performance of these instructions seems to be comparable with the x64 instructions except for the division:

  1. The 32 bit x87 float instructions use 10 byte precision internally.
  2. Extended precision division is super slow.

The division operation makes the 32 bit version extremely slow. Uncommenting the division equalizes performance to a large degree (32 bit down from 430ms to 3.25ms).

Peter Cordes points out that the instruction latencies of the two floating point units are not that dissimilar. Maybe some of the intermediate results are denormalized numbers or NaN. These might trigger a slow path in one of the units. Or, maybe the values diverge between the two implementations because of 10 byte vs. 8 byte float precision.

Peter Cordes also points out that all intermediate results are NaN... Removing this problem (valueList.Add(i + 1) so that no divisor is zero) mostly equalizes the results. Apparently, the 32 bit code does not like NaN operands at all. Let's print some intermediate values: if (i % 1000 == 0) Console.WriteLine(result);. This confirms that the data is now sane.

When benchmarking you need to benchmark a realistic workload. But who would have thought that an innocent division can mess up your benchmark?!

Try simply summing the numbers to get a better benchmark.

Division and modulo are always very slow. If you modify the BCL Dictionary code to simply not use the modulo operator to compute the bucket index performance measurable improves. This is how slow division is.

Here's the 32 bit code:

64 bit code (same structure, fast division):

This is not vectorized despite SSE instructions being used.

这篇关于编译为32位和64位时,巨大的性能差异(26倍快)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆