64 位模运算的奇怪性能行为 [英] Strange performance behaviour for 64 bit modulo operation

查看:25
本文介绍了64 位模运算的奇怪性能行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这些方法调用的最后三个需要大约.时间是前四个的两倍.

The last three of these method calls take approx. double the time than the first four.

唯一的区别是它们的参数不再适合整数.但这应该重要吗?参数被声明为long,所以无论如何它都应该使用long进行计算.取模运算是否使用另一种算法获取 numbers>maxint?

The only difference is that their arguments doesn't fit in integer anymore. But should this matter? The parameter is declared to be long, so it should use long for calculation anyway. Does the modulo operation use another algorithm for numbers>maxint?

我使用的是 amd athlon64 3200+、winxp sp3 和 vs2008.

I am using amd athlon64 3200+, winxp sp3 and vs2008.

       Stopwatch sw = new Stopwatch();
       TestLong(sw, int.MaxValue - 3l);
       TestLong(sw, int.MaxValue - 2l);
       TestLong(sw, int.MaxValue - 1l);
       TestLong(sw, int.MaxValue);
       TestLong(sw, int.MaxValue + 1l);
       TestLong(sw, int.MaxValue + 2l);
       TestLong(sw, int.MaxValue + 3l);
       Console.ReadLine();

    static void TestLong(Stopwatch sw, long num)
    {
        long n = 0;
        sw.Reset();
        sw.Start();
        for (long i = 3; i < 20000000; i++)
        {
            n += num % i;
        }
        sw.Stop();
        Console.WriteLine(sw.Elapsed);            
    }

我现在用 C 尝试了同样的方法,但问题不会在这里发生,所有模运算都需要相同的时间,在发布和调试模式下,无论是否打开优化:

I now tried the same with C and the issue does not occur here, all modulo operations take the same time, in release and in debug mode with and without optimizations turned on:

#include "stdafx.h"
#include "time.h"
#include "limits.h"

static void TestLong(long long num)
{
    long long n = 0;

    clock_t t = clock();
    for (long long i = 3; i < 20000000LL*100; i++)
    {
        n += num % i;
    }

    printf("%d - %lld
", clock()-t, n);  
}

int main()
{
    printf("%i %i %i %i

", sizeof (int), sizeof(long), sizeof(long long), sizeof(void*));

    TestLong(3);
    TestLong(10);
    TestLong(131);
    TestLong(INT_MAX - 1L);
    TestLong(UINT_MAX +1LL);
    TestLong(INT_MAX + 1LL);
    TestLong(LLONG_MAX-1LL);

    getchar();
    return 0;
}

谢谢你的好建议.我发现 .net 和 c(在调试和发布模式下)都没有使用原子 cpu 指令来计算余数,但它们调用了一个函数.

Thanks for the great suggestions. I found that both .net and c (in debug as well as in release mode) does't not use atomically cpu instructions to calculate the remainder but they call a function that does.

在 c 程序中,我可以得到它的名称,即_allrem".它还显示了该文件的完整源注释,因此我发现该算法特殊情况的信息是 32 位除数而不是 .net 应用程序中的除数.

In the c program I could get the name of it which is "_allrem". It also displayed full source comments for this file so I found the information that this algorithm special cases the 32bit divisors instead of dividends which was the case in the .net application.

我还发现c程序的性能真的只受除数的值影响,而不是被除数的值.另一项测试表明,.net 程序中余数函数的性能取决于被除数和除数.

I also found out that the performance of the c program really is only affected by the value of the divisor but not the dividend. Another test showed that the performance of the remainder function in the .net program depends on both the dividend and divisor.

顺便说一句:即使是 long long 值的简单加法也是通过连续的 add 和 adc 指令计算出来的.因此,即使我的处理器称自己为 64 位,也确实不是 :(

BTW: Even simple additions of long long values are calculated by a consecutive add and adc instructions. So even if my processor calls itself 64bit, it really isn't :(

我现在在 windows 7 x64 版本上运行 c 应用程序,用 Visual Studio 2010 编译.有趣的是,性能行为保持不变,尽管现在(我检查了程序集源)使用了真正的 64 位指令.

I now ran the c app on a windows 7 x64 edition, compiled with visual studio 2010. The funny thing is, the performance behavior stays the same, although now (I checked the assembly source) true 64 bit instructions are used.

推荐答案

多么奇怪的观察.您可以执行以下操作来进一步调查:在程序开头添加一个暂停",例如 Console.ReadLine,但在第一次调用您的方法之后.然后以发布"模式构建程序.然后启动程序不在调试器中.然后,在暂停时,附加调试器.调试它并查看为相关方法编写的代码.找到循环体应该很容易.

What a curious observation. Here's something you can do to investigate this further: add a "pause" at the beginning of the program, like a Console.ReadLine, but AFTER the first call to your method. Then build the program in "release" mode. Then start the program not in the debugger. Then, at the pause, attach the debugger. Debug through it and take a look at the code jitted for the method in question. It should be pretty easy to find the loop body.

了解生成的循环体与 C 程序中的循环体有何不同会很有趣.

It would be interesting to know how the generated loop body differs from that in your C program.

跳过所有这些障碍的原因是因为抖动改变了它在抖动调试"程序集时生成的代码在抖动已经附加了调试器的程序时;在这些情况下,它会在调试器中生成更容易理解的代码.看看 jitter 认为什么是针对这种情况生成的最佳"代码会更有趣,因此您必须在 jitter 运行后延迟附加调试器.

The reason for all those hoops to jump through is because the jitter changes what code it generates when jitting a "debug" assembly or when jitting a program that already has a debugger attached; it jits code that is easier to understand in a debugger in those cases. It would be more interesting to see what the jitter thinks is the "best" code generated for this case, so you have to attach the debugger late, after the jitter has run.

这篇关于64 位模运算的奇怪性能行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆