更有效的汇编代码? [英] More efficient assembly code?

查看:89
本文介绍了更有效的汇编代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始研究汇编.只是想知道为什么以这种方式编写该程序集,而不是下面列出的替代我的程序集".它切出一条指令.有任何想法吗?在这种情况下是否太罕见了?对我而言,先将3的值移至eax似乎是浪费.

I've recently began studying assembly. Just wondering why this assembly is written the way it is instead of the alternative "My Assembly" I list below. It cuts out one instruction. Any ideas? Is it too rare of a case where this works? Just seems wasteful to me to move the value of 3 to eax first.

C代码:

#include<stdio.h>

int main()
{
   int a = 1;
   int b = 3;
   a = a+b;
   return a;
}

组装:

Dump of assembler code for function main:
0x080483dc <+0>:    push   ebp
0x080483dd <+1>:    mov    ebp,esp
0x080483df <+3>:    sub    esp,0x10
0x080483e2 <+6>:    mov    DWORD PTR [ebp-0x4],0x1
0x080483e9 <+13>:   mov    DWORD PTR [ebp-0x8],0x3
0x080483f0 <+20>:   mov    eax,DWORD PTR [ebp-0x8]
0x080483f3 <+23>:   add    DWORD PTR [ebp-0x4],eax
0x080483f6 <+26>:   mov    eax,DWORD PTR [ebp-0x4]
0x080483f9 <+29>:   leave  
0x080483fa <+30>:   ret   

我的程序集":

Dump of assembler code for function main:
0x080483dc <+0>:    push   ebp
0x080483dd <+1>:    mov    ebp,esp
0x080483df <+3>:    sub    esp,0x10
0x080483e2 <+6>:    mov    DWORD PTR [ebp-0x4],0x1
0x080483e9 <+13>:   mov    DWORD PTR [ebp-0x8],0x3
0x080483f0 <+20>:   add    DWORD PTR [ebp-0x4],DWORD PTR [ebp-0x8]
0x080483f3 <+23>:   mov    eax,DWORD PTR [ebp-0x4]
0x080483f6 <+26>:   leave
0x080483f9 <+29>:   ret   

推荐答案

正如Michael Petch在评论中所说,真正的答案是您正在查看未优化的代码.编译器会在未经优化的代码中执行各种……低效率的工作.有时他们这样做是为了提高编译速度.与将C代码盲目地转换为汇编指令相比,优化花费的时间更长,因此,当您想要原始速度时,可以关闭优化器并仅使用编译器:相对简单的指令翻译器.编译器在未优化的代码中执行低效率工作的另一个原因是使调试更加容易.例如,您的IDE可能允许您在C/C ++代码的每一行上设置一个断点.如果优化程序已将多条C/C ++行转换为一条汇编指令,那么对您要设置的断点进行设置将更加困难,甚至不是不可能.这就是为什么调试优化的代码要困难得多,并且通常需要下拉到原始程序集并进行地址级调试的原因.

As Michael Petch has already said in a comment, the real answer is that you're looking at unoptimized code. Compilers do all sorts of…well, inefficient things in unoptimized code. Sometimes they do it for compilation speed. Optimizing takes longer than blindly translating the C code into assembly instructions, so when you want raw speed, you turn off the optimizer and use only the compiler: a relatively simple-minded instruction translator. Another reason that compilers do inefficient things in unoptimized code is to make debugging easier. For example, your IDE probably lets you set a breakpoint on each individual line of your C/C++ code. If the optimizer had turned multiple C/C++ lines into a single assembly instruction, it would be much more difficult, if not impossible, for you to set the breakpoints you wanted to set. This is why debugging optimized code is much more difficult, and often requires dropping down to the raw assembly and doing address-level debugging.

这里有两个无效的提示,告诉您这是未经优化的代码:

There are two dead giveaways here that tell you this is unoptimized code:

  1. 使用leave指令,该指令本质上是x86的CISC时代的历史遗迹.以前的哲学是有一堆指令来完成复杂的事情,因此enter指令在函数的开头用于设置堆栈框架,而leave指令则调高到后面,拆掉堆栈框架.这使使用块结构语言的程序员的工作更加轻松,因为您只需要编写一条指令即可完成多个任务.问题在于,由于至少386(可能是286),enter指令比使用更简单的单独指令执行相同的操作要慢得多. leave在386和更高版本上也较慢,并且仅在您针对大小而不是速度进行优化时才有用(因为它更小且不如enter慢).

  1. The use of the leave instruction, which is essentially a historical relic of the x86's CISC days. The philosophy used to be to have a bunch of instructions that did complicated things, so the enter instruction was used at the beginning of a function to set up the stack frame, and the leave instruction brought up the rear, tearing down the stack frame. This made programmer's job easier in block-structured languages, because you only needed to write a single instruction to accomplish multiple tasks. The problem is, since at least the 386, possibly the 286, the enter instruction has been substantially slower than doing the same thing with simpler, separate instructions. leave is also slower on the 386 and later, and is only useful when you're optimizing for size over speed (since it is smaller and not quite as slow as enter).

实际上已经建立了一个堆栈框架!在任何优化级别,一个32位x86编译器都不会费心生成设置堆栈框架的序言代码.也就是说,它不会保存EBP寄存器的原始值,也不会将EBP寄存器设置为函数入口处堆栈指针(ESP)的位置.相反,它将执行帧指针省略"优化(EBP寄存器称为帧指针"),而不是使用EBP相对偏移量来访问堆栈,它只会使用ESP-相对偏移量.这在16位x86代码中以前是不可能的,但是在32位代码中却可以正常工作,它只需要进行更多记账,因为堆栈指针可以更改,但是帧指针可以保持不变.对于计算机/编译器而言,这种簿记几乎不是人类的问题,因此这是显而易见的优化.

The fact that a stack frame is being set up at all! At any optimization level, a 32-bit x86 compiler won't bother to generate prologue code that sets up a stack frame. That is, it won't save the original value of the EBP register and it won't set the EBP register to the location of the stack pointer (ESP) at function entry. Instead, it will perform the "frame pointer omission" optimization (the EBP register is called the "frame pointer"), and instead of using EBP-relative offsets to access the stack, it will just use ESP-relative offsets. This didn't used to be possible in 16-bit x86 code, but it works fine in 32-bit code, it just takes more bookkeeping, since the stack pointer is subject to change, but the frame pointer could be held constant. Such bookkeeping isn't nearly the problem for a computer/compiler that it would be for a human, so this is an obvious optimization.

您的"程序集的另一个问题是您使用了无效的指令. x86体系结构中没有指令 * 接受两个内存操作数. 至多,一个可以是一个内存位置.另一个操作数必须是寄存器或立即数.

Another problem with "your" assembly is that you've used an invalid instruction. There is no instruction* in the x86 architecture that accepts two memory operands. At most, one of the operands can be a memory location. The other operand must either be a register or an immediate.

此代码的第一笔优化"版本类似于:

The first-blush "optimized" version of this code would be something like:

; Allocate 8 bytes of space on the stack for our local variables, 'a' and 'b'.
sub  esp, 8

; Load the values of 'a' and 'b', storing them into the allocated locations.
; (Note the use of ESP-relative offsets, rather than EBP-relative offsets.)
mov  DWORD PTR [esp],     1
mov  DWORD PTR [esp + 4], 3

; Load the value of 'a' into a register (EAX), and add 'b' to it.
; (Necessary because we can't do an ADD with two memory operands.)
mov  eax, DWORD PTR [esp]
add  eax, DWORD PTR [esp + 4]

; The result is now in EAX, which is exactly where we want it to be.
; (All x86 calling conventions return integer-sized values in EAX.)

; Clean up the stack, and return.
add  esp, 8
ret

我们已经优化"了堆栈的初始化顺序,并失去了很多绒毛.现在情况看起来还不错.实际上,如果要声明ab变量volatile,这实际上就是编译器将生成的代码.但是,它们在原始代码中实际上不是 volatile,这意味着我们可以将它们完全保留在寄存器中.这使我们不必进行任何昂贵的内存存储/加载操作,这意味着我们根本不必分配或恢复堆栈空间!

We've "optimized" the stack initialization sequence, and lost a lot of the fluff. Things look pretty good now. In fact, this is essentially the code that a compiler will generate if you were to declare the a and b variables volatile. However, they're not actually volatile in the original code, which means that we can keep them entirely in registers. This frees us from having to do any costly memory stores/loads, and means we don't have to allocate or restore stack space at all!

; Load the 'a' and 'b' values into the EAX and EDX registers, respectively.
mov  eax, 1
mov  edx, 3

; Add 'b' to 'a' in a single operation, since ADD works fine with
; two register operands.
add  eax, edx

; Return, with result in EAX.
ret

整洁吧?这不仅简化了代码,而且实际上是一项重大的性能提升,因为我们将所有内容保留在寄存器中,而不必接触慢速存储器.好吧,我们还能做什么?请记住,ADD指令允许我们将寄存器用作目标操作数,而将立即数用作源操作数.这意味着我们可以跳过MOV而只需:

Neat, right? This not only simplifies the code, but is actually a big performance win, since we're keeping everything in registers and never have to touch slow memory. Well, what else can we do? Remember that the ADD instruction allows us to use a register as the destination operand and an immediate as the source operand. That means we could skip a MOV and just do:

mov  eax, 1
add  eax, 3
ret

这类似于您希望看到的,例如,将常数3添加到内存中已经存在的值:

This is similar to what you would expect to see if you were, say, adding a constant 3 to a value already in memory:

add  DWORD PTR [esp + 4], 3

但是在这种情况下,优化的编译器永远不会那样做.实际上,它将意识到您正在对编译时常量进行加法运算,然后继续在编译时进行加法运算.因此,编译器的实际输出(实际上是最有效的编写此代码的方式)将很简单:

But in this case, an optimizing compiler would never do it that way. It will actually outsmart you, realizing that you're doing an addition of compile-time constants, and go ahead and do the addition at compile time. Thus, the actual output of a compiler—and indeed, the most efficient way to write this code—would be simply:

mov  eax, 4
ret

如何抗高潮. :-)最快的代码始终是不必执行的代码.

How anti-climactic. :-) The fastest code is always the code that doesn't have to execute.

* 至少,目前我还没有想到. x86 ISA的功能非常强大,因此几乎不可避免地会出现一些黑暗的角落,我无法想到该声明在哪里是错误的.但是,确实可以将其作为公理来依赖.

*At least, not that I can think of at the moment. The x86 ISA is colossal, so it's almost inevitable that there's some dark corner of it that I can't think of where this statement is false. But it's true enough that you can rely on it as an axiom.

这篇关于更有效的汇编代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆