是内联汇编语言比本地C ++ code慢? [英] Is inline assembly language slower than native C++ code?

查看:132
本文介绍了是内联汇编语言比本地C ++ code慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图比较内联汇编语言和C ++ code中的表现,让我写了规模增加2000年两列100000次的功能。这里的code:

 的#define TIMES 100000
无效calcuC(INT * X,诠释* Y,诠释长度)
{
    的for(int i = 0; I<次;我++)
    {
        对于(INT J = 0; J<长度; J ++)
            X [J] + = Y [J]。
    }
}
无效calcuAsm(INT * X,诠释* Y,诠释lengthOfArray)
{
    __asm
    {
        MOV EDI,时代
        开始:
        MOV ESI,0
        MOV ECX,lengthOfArray
        标签:
        MOV EDX,X
        推EDX
        MOV EAX,DWORD PTR [EDX + ESI * 4]
        MOV EDX,Y
        MOV EBX,DWORD PTR [EDX + ESI * 4]
        添加EAX,EBX
        流行EDX
        MOV [EDX + ESI * 4],EAX
        INC ESI
        环标签
        十二月EDI
        CMP EDI,0
        JNZ启动
    };
}

下面的的main()

  INT的main(){
    布尔errorOccured = FALSE;
    则setbuf(标准输出,NULL);
    INT * XC * XASM,* YC,* YASM;
    XC = INT新[2000];
    XASM = INT新[2000];
    YC = INT新[2000];
    YASM = INT新[2000];
    的for(int i = 0; I< 2000;我++)
    {
        XC [I] = 0;
        XASM [I] = 0;
        YC [我] =我;
        YASM [我] =我;
    }
    time_t的开始=时钟();
    calcuC(XC,YC,2000年);    // calcuAsm(XASM,YASM,2000年);
    //的for(int i = 0; I< 2000;我++)
    // {
    //如果(XC [I]!= XASM [I])
    // {
    // COUT&所述;&下,XC [&所述;&下; I&所述;&下;] =下;&下; XC [1] - ;&下;&所述;&下;XASM [&所述;&下; I&下; &下;] =下;&下; XASM [1] - ;&下; ENDL;
    // errorOccured = TRUE;
    //突破;
    //}
    //}
    //如果(errorOccured)
    // COUT<<错误!<< ENDL;
    //其他
    // COUT<<工作正常!<< ENDL;    time_t的结束=时钟();    // COUT<<时间=<<(浮点)(完 - 开始)/ CLOCKS_PER_SEC<<\\ n;    COUT<<时间=<<结束 - 开始与LT;< ENDL;
    返回0;
}

然后我运行程序五次以获得处理器的周期,这可以被看作是时间。每次我打电话上面只提到功能之一。

和来这里的结果。

组装版本

功能:

 调试发布
---------------
732 668
733 680
659 672
667 675
684 694
一般:677

的C ++功能版本:

 调试发布
-----------------
1068 168
 999 166
1072 231
1002 166
1114 183
一般:182

在释放模式的C ++ code比组装code快了近3.7倍。为什么呢?

我想,大会code我写的是不是由GCC产生的一样有效。很难为了一个共同的程序员像我这样写道:code不是由compiler.Does意思我不应该相信通过我的双手写汇编语言的性能产生了对手快,着眼于C ++,而忘记了汇编语言<? / p>

解决方案

是的,最次的。

首先,你从错误的假设开始,一个低级语言(汇编在这种情况下)总是会产生更快的code比高级语言(C ++和C在这种情况下)。这不是真的。是C code总是比Java code更快?没有,因为还有另外一个变量:程序员。你写code和建筑细节知识的方式极大地影响性能(如你在这种情况下所看到的)。

您可以始终产生一个例子,其中手工组装code是比编译code更好,但一般这是一个虚构的例子或单个程序不的真正的的200.000+线C ++ code)的计划。我想,编译器会产生更好的装配code 95%的时间(而且我们没有忘记,一个汇编器是一个编译器也是一样,它可能会做一些优化)和有时和只有一些罕见的次您可能需要编写汇编code为少,短,高度使用,<一个href=\"http://www.douglocke.com/Downloads/Performance-Critical%20Systems%20White%20Paper.pdf\">performance关键例程或当你有访问功能,您最喜欢的高级语言不公开。为什么这样?

首先,因为编译器可以做优化,我们甚至不能想象(见这个短名单),他们会做他们的的(当我们可能需要几天)。

在code汇编你有一个明确的调用接口进行良好定义的功能。不过,他们可以在帐户整个程序的优化和内部程序优化等
作为寄存器分配常量传播共同SUBEX pression消除,的instruction调度和其他复杂的,不是明显的优化(多面体模型< /一>,例如)。在 RISC架构人不再担心这个多年前(指令调度,例如,是很难的tune手)和现代的 CISC CPU的有很长的管道了。

有关连的系统的库是用C写的,而不是组装,因为他们的编译器生成一些复杂的微控制器更好的(和易于维护的)最终code。

编译器有时可以自动使用自己的一些MMX / SIMDx说明,如果你不使用它们你根本无法相提并论(已审核了您的组装code非常好其他的答案)。
就在for循环,这是循环优化是什么一般通过检查名单编译器(?你认为当你的日程安排已经确定为C#程序,你可以自己来做),如果你写汇编的东西,我认为你必须考虑至少有一些的简单的优化的。数组学校书例子是展开循环(其大小在编译时已知)。做到这一点,并再次运行测试。

这些天来它也确实少见,需要使用汇编语言的另一个原因是:不同的CPU 过多 。你想支持他们呢?每个人都有一个特定的一些的特定指令集。他们有不同数量的功能单元和组装说明书应安排,让他们所有的的。如果你用C写的,你可以使用 PGO 但在组装然后你会需要一个特定的体系结构的一个很大的学问(和重新思考和再架构)重做一切。对于小型任务的编译器的一般的做的更好,而对于复杂的任务一般的工作没有还清(和的编译器的可以的做的更好反正)。

如果你坐下来,你看看你的code可能你会看到,你会获得更多的重新设计你的算法,而不是转化为组件(阅读<一href=\"http://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773\">great张贴在这里的SO ),有高层次的优化(和提示,编译器),你需要求助于汇编语言,然后才能有效地运用。

这一切说,即使你能产生5〜10倍的速度组装code,你应该问你的客户,如果他们preFER为付费一周的你的时间买个50 $更快的CPU 。至尊优化往往不是(尤其是在LOB应用程序)根本就没有从我们大多数人的需要。

I tried to compare the performance of inline assembly language and C++ code, so I wrote a function that add two arrays of size 2000 for 100000 times. Here's the code:

#define TIMES 100000
void calcuC(int *x,int *y,int length)
{
    for(int i = 0; i < TIMES; i++)
    {
        for(int j = 0; j < length; j++)
            x[j] += y[j];
    }
}


void calcuAsm(int *x,int *y,int lengthOfArray)
{
    __asm
    {
        mov edi,TIMES
        start:
        mov esi,0
        mov ecx,lengthOfArray
        label:
        mov edx,x
        push edx
        mov eax,DWORD PTR [edx + esi*4]
        mov edx,y
        mov ebx,DWORD PTR [edx + esi*4]
        add eax,ebx
        pop edx
        mov [edx + esi*4],eax
        inc esi
        loop label
        dec edi
        cmp edi,0
        jnz start
    };
}

Here's main():

int main() {
    bool errorOccured = false;
    setbuf(stdout,NULL);
    int *xC,*xAsm,*yC,*yAsm;
    xC = new int[2000];
    xAsm = new int[2000];
    yC = new int[2000];
    yAsm = new int[2000];
    for(int i = 0; i < 2000; i++)
    {
        xC[i] = 0;
        xAsm[i] = 0;
        yC[i] = i;
        yAsm[i] = i;
    }
    time_t start = clock();
    calcuC(xC,yC,2000);

    //    calcuAsm(xAsm,yAsm,2000);
    //    for(int i = 0; i < 2000; i++)
    //    {
    //        if(xC[i] != xAsm[i])
    //        {
    //            cout<<"xC["<<i<<"]="<<xC[i]<<" "<<"xAsm["<<i<<"]="<<xAsm[i]<<endl;
    //            errorOccured = true;
    //            break;
    //        }
    //    }
    //    if(errorOccured)
    //        cout<<"Error occurs!"<<endl;
    //    else
    //        cout<<"Works fine!"<<endl;

    time_t end = clock();

    //    cout<<"time = "<<(float)(end - start) / CLOCKS_PER_SEC<<"\n";

    cout<<"time = "<<end - start<<endl;
    return 0;
}

Then I run the program five times to get the cycles of processor, which could be seen as time. Each time I call one of the function mentioned above only.

And here comes the result.

Function of assembly version:

Debug   Release
---------------
732        668
733        680
659        672
667        675
684        694
Average:   677

Function of C++ version:

Debug     Release
-----------------
1068      168
 999      166
1072      231
1002      166
1114      183
Average:  182

The C++ code in release mode is almost 3.7 times faster than the assembly code. Why?

I guess that the assembly code I wrote is not as effective as those generated by GCC. It's hard for a common programmer like me to wrote code faster than its opponent generated by a compiler.Does that mean I should not trust the performance of assembly language written by my hands, focus on C++ and forget about assembly language?

解决方案

Yes, most times.

First of all you start from wrong assumption that a low-level language (assembly in this case) will always produce faster code than high-level language (C++ and C in this case). It's not true. Is C code always faster than Java code? No because there is another variable: programmer. The way you write code and knowledge of architecture details greatly influence performance (as you saw in this case).

You can always produce an example where handmade assembly code is better than compiled code but usually it's a fictional example or a single routine not a true program of 200.000+ lines of C++ code). I think compilers will produce better assembly code 95% times (moreover we don't have to forget that an assembler is a compiler too and it may do few optimizations) and sometimes and only some rare times you may need to write assembly code for few, short, highly used, performance critical routines or when you have to access features your favorite high-level language does not expose. Why this?

First of all because compilers can do optimizations that we can't even imagine (see this short list) and they will do them in seconds (when we may need days).

When you code in assembly you have to make well-defined functions with a well-defined call interface. However they can take in account whole-program optimization and inter-procedural optimization such as register allocation, constant propagation, common subexpression elimination, instruction scheduling and other complex, not obvious optimizations (Polytope model, for example). On RISC architecture guys stopped worrying about this many years ago (instruction scheduling, for example, is very hard to tune by hand) and modern CISC CPUs have very long pipelines too.

For some complex microcontrollers even system libraries are written in C instead of assembly because their compilers produce a better (and easy to maintain) final code.

Compilers sometimes can automatically use some MMX/SIMDx instructions by themselves, and if you don't use them you simply can't compare (other answers already reviewed your assembly code very well). Just for loops this is a short list of loop optimizations of what is commonly checked for by a compiler (do you think you could do it by yourself when your schedule has been decided for a C# program?) If you write something in assembly, I think you have to consider at least some simple optimizations. The school-book example for arrays is to unroll the cycle (its size is known at compile time). Do it and run your test again.

These days it's also really uncommon to need to use assembly language for another reason: the plethora of different CPUs. Do you want to support them all? Each has a specific microarchitecture and some specific instruction sets. They have different number of functional units and assembly instructions should be arranged to keep them all busy. If you write in C you may use PGO but in assembly you will then need a great knowledge of that specific architecture (and rethink and redo everything for another architecture). For small tasks the compiler usually does it better, and for complex tasks usually the work isn't repaid (and compiler may do better anyway).

If you sit down and you take a look at your code probably you'll see that you'll gain more to redesign your algorithm than to translate to assembly (read this great post here on SO), there are high-level optimizations (and hints to compiler) you can effectively apply before you need to resort to assembly language.

All this said, even when you can produce a 5~10 times faster assembly code, you should ask your customers if they prefer to pay one week of your time or to buy a 50$ faster CPU. Extreme optimization more often than not (and especially in LOB applications) is simply not required from most of us.

这篇关于是内联汇编语言比本地C ++ code慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆