英特尔X86汇编:如何分辨多少位是一个论点? [英] Intel X86 Assembly: How to tell many bits wide is an argument?

查看:53
本文介绍了英特尔X86汇编:如何分辨多少位是一个论点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在以下程序集中:

mov     dx, word ptr [ebp+arg_0]
mov     [ebp+var_8], dx

将其视为一个组合的C函数,arg_0(C函数的参数)宽多少位?(本地C变量)var_8的宽度为多少位?也就是说,它是short还是int等.

Thinking of this as an assembled C function, how many bits wide is (the argument to the C function) arg_0? How many bits wide is (the local C variable) var_8? That is to say, is it a short, an int, etc.

由此看来,由于dx是16位寄存器,因此var_8是16位.但是我不确定arg_0.

From this, it appears that var_8 is 16 bits, since dx is a 16-bit register. But I'm not sure about arg_0.

如果程序集也包含以下行:

If the assembly also contains this line:

ecx, [ebp+arg_0]

这是否意味着arg_0是32位值?

Would that imply that arg_0 is a 32-bit value?

推荐答案

要解决此问题,需要理解三个原则.

There are three principles to understand in order to tackle this question.

  1. 汇编器必须能够推断出正确的长度.
    尽管Intel语法没有使用 AT& T这样的大小后缀语法,汇编器仍然需要一种方法来找到操作数的大小.
     
    如果存储区的大小为32位(请注意后缀),则模棱两可的指令 mov [var],1 用AT& T语法写为 movl $ 1,var l ),因此很容易分辨出立即操作数的大小.
    接受Intel语法的汇编程序需要一种推断大小的方法,有四个广泛使用的选项:

  1. The assembler must be able to infer the correct length.
    Though the Intel's syntax is not using a size suffix like the AT&T syntax the assembler still need a way to find the size of the operands.
     
    The ambiguous instruction mov [var], 1 is written as movl $1, var in AT&T syntax, if the size of the store is 32-bit (note the suffix l), so it is easy to tell the size of the immediate operand.
    The assembler that accepts the Intel syntax needs a way to infer this size, there are four widely used options:

  • 是从另一个操作数推断的.
    例如,当涉及到寄存器时,就是这种情况.
    例如. mov [var],dx 是一个16位存储区.
  • 明确声明.
    mov WORD [var],dx
    MASM语法汇编程序在大小后需要一个 PTR ,因为它们的大小说明符仅允许用于内存操作数,而不能用于立即数或其他任何地方.
    这是我喜欢的形式,因为它很清楚,它突出并且不易出错( mov WORD [var],edx 无效).
  • 从上下文中推断出来.

  • It is inferred from the other operand.
    This is the case when a register is involved, for example.
    E.g. mov [var], dx is a 16-bit store.
  • It is stated explicitly.
    mov WORD [var], dx
    MASM-syntax assemblers need a PTR after the size, because their size specifiers are only allowed on memory operands, not immediates or anywhere else.
    This is the form I prefer because it is clear, it stands out and it is a bit less error-prone (mov WORD [var], edx is invalid).
  • It is inferred from the context.

 var db 0

 mov [var], 1   ; MASM/TASM only.   associate sizes with labels 

MASM语法汇编程序可以推断,由于 var 是用 db 声明的,因此其大小为8位,存储区的大小也为8位(默认情况下).这是我不喜欢的形式,因为它使代码更难阅读(关于汇编的一件好事是指令语义的局部性"),并将诸如类型的高级概念与诸如存储的低级概念混合在一起大小.这就是为什么 NASM的语法不支持魔术/非本地大小关联.

MASM-syntax assemblers can infer that since var is declared with db its size is 8-bit and so is the store (by default).
This is the form I don't like because it makes the code harder to read (one good thing about assembly is the "locality" of the semantics of the instructions) and mix high-level concepts like types with low-level concepts like store sizes. That's why NASM's syntax doesn't support magical / non-local size association.

 
简而言之,汇编程序必须使用 告知大小,否则它将拒绝代码.(或者对于某些模棱两可的情况,某些低质量的汇编器(例如emu8086)具有默认的操作数大小.)

 
To put it short, there must be a way for the assembler to tell the size, otherwise it will reject the code. (Or some low quality assemblers like emu8086 have a default operand size for ambiguous cases.)

如果您正在查看反汇编的代码,则反汇编程序通常会比较安全,并始终明确声明其大小.
如果不是,则必须求助于对操作码的手动检查,如果反汇编程序不显示操作码,则是时候对其进行更改了.
反汇编程序毫不费力地找出操作数的大小,因为它正在反汇编的二进制代码与CPU执行的操作相同,并且指令操作码对操作数的大小进行编码.
 

If you are looking at a disassembled code, disassemblers usually take the safe side and always state the size explicitly.
If not, you must resort to manual inspection of the opcode, if the disassembler won't show the opcodes, it is time to change it.
The disassembler has no trouble finding out the size of the operand as the binary code it is disassembling is the same executed by the CPU and the instructions opcodes encode the operand size.
 

C语言故意在C类型如何映射到位数方面松懈
 
尝试从反汇编中推断变量的类型并不是徒劳的,但也必须考虑平台,而不仅仅是架构.
此处进行讨论:

Datatype    LP64    ILP64   LLP64   ILP32   LP32
char        8       8       8       8       8
short       16      16      16      16      16
_int32      32          
int         32      64      32      32      16
long        64      64      32      32      32
long long                   64      [64]                    
pointer     64      64      64      32      32

x86_64上的Windows使用LLP64.x86-64上的其他操作系统通常使用LP64模型x86-64 System V ABI.

Windows on x86_64 uses LLP64. Other OSes on x86-64 typically use the x86-64 System V ABI, an LP64 model.

程序集没有类型,程序员可以利用该类型
 
即使编译器也可以利用它..
 
在链接的情况下,将类型为 long long (64位)的 bar 变量与1进行或"运算, clang 仅通过或"运算来保留REX前缀低字节.如果立即用两个dword加载或一个qword重新加载该变量,则会导致存储转发停顿,因此,这可能不是一个好选择,尤其是在32位模式下,其中或dword [bar]为1 的大小相同,并且有可能作为两个32位的一半重新加载.
如果人们不小心看待反汇编的代码,他们可以推断出 bar 是8位的.
在部分访问变量或对象的这种技巧很常见.
 
为了正确猜测变量的大小,需要一些专业知识.
例如,结构成员通常被填充,因此它们之间存在未使用的空间,这可能使经验不足的用户误以为每个成员都大于它.
堆栈具有精确的对齐要求,可以扩大参数的大小.
 
经验法则是,编译器通常更喜欢使堆栈保持16字节对齐,并自然对齐所有变量.将多个窄变量打包到单个dword中.通过堆栈传递函数args时,每个参数都会被填充为32位或64位,但这不适用于堆栈上本地变量的布局.

Assembly doesn't have types and programmers can exploit that
 
Even compilers can exploit that.
 
In the case linked a bar variable of type long long (64-bit) is ORed with 1, clang spares a REX prefix by ORing only the low byte. This causes a store-forwarding stall if the variable is reloaded again right away with two dword loads or one qword, so it's probably not a good choice, especially in 32-bit mode where or dword [bar], 1 is the same size and it's likely to be reloaded as two 32-bit halves.
If one would look at the disassembled code incautiously they could infer that bar is 8-bit.
This kind of tricks, where a variable or an object, are accessed partially are common.
 
In order to correctly guess the size of a variable it takes a bit of expertise.
For example, structures members are usually padded, so there is unused space between them that may fool the inexperienced user into thinking that each member is bigger than it is.
The stack has precise alignment requirements that also may make widen the parameters size.
 
The rule of thumb is that compilers generally prefer to keep the stack 16-byte aligned, and naturally-align all variables. Multiple narrow variables are packed into a single dword. When passing function args via the stack, each one is padded to 32 or 64-bit, but that doesn't apply to the layout of locals on the stack.

最终回答您的问题

是的,从第一段代码中,您可以假定 arg_0 的值是16位宽.
请注意,由于这是在堆栈上传递的函数arg,因此它实际上是32位,但未使用高16位.

Yes, from the first snippet of code you can assume that the value of arg_0 is 16-bit wide.
Note that since it's a function arg passed on the stack, it is actually 32-bit but the upper 16 bits are not used.

如果 mov ecx,[ebp + arg_0] 在代码中出现的时间比您重新查看有关 arg_0 的值大小的猜测要晚,当然至少是32位.
它不太可能是64位(64位类型在32位代码中很少见,我们可以打赌),因此我们可以得出结论,它是32位.
显然,第一个代码片段是仅使用变量一部分的技巧之一.

If a mov ecx, [ebp+arg_0] appeared later in the code than you would have to revisit your guess about the size of the value of arg_0, it is certainly at least 32-bit.
It is unlikely that it is 64-bit (64-bit type are rare in 32-bit code, we can make this bet) so we can conclude it is 32-bit.
Evidently, the first snippet was one of those tricks that only uses a part of a variable.

这就是您如何对var大小进行逆向工程,进行猜测,验证其是否与其余代码一致,是否重新访问,重复的方式.
随着时间的流逝,您将做出几乎不需要修改的几乎所有好的猜测.

That's how you deal with reverse engineering a size of a var, you make a guess, verify it is consistent with the rest of the code, revisit it if not, repeat.
With time you'll make mostly good guesses that need no revision at all.

这篇关于英特尔X86汇编:如何分辨多少位是一个论点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆