将 GNU 中的内存与常量区分为 .intel_syntax [英] Distinguishing memory from constant in GNU as .intel_syntax

查看:25
本文介绍了将 GNU 中的内存与常量区分为 .intel_syntax的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一条用 Intel 语法编写的指令(使用 gas 作为我的汇编程序),如下所示:

I have an instruction written in Intel syntax (using gas as my assembler) that looks like this:

mov rdx, msg_size
...
msg: .ascii "Hello, world!
"
     .set msg_size, . - msg

但是那条 mov 指令正在被组装成 mov 0xe,%rdx,而不是我所期望的 mov $0xe,%rdx.我应该如何编写第一条指令(或 msg_size 的定义)以获得预期的行为?

but that mov instruction is being assembled to mov 0xe,%rdx, rather than mov $0xe,%rdx, as I would expect. How should I write the first instruction (or the definition of msg_size) to get the expected behavior?

推荐答案

使用 mov edx, OFFSET symbol 获取符号地址",而不是从它作为地址加载.这适用于实际标签地址以及您使用 .set 设置为整数的符号.

Use mov edx, OFFSET symbol to get the symbol "address" as an immediate, rather than loading from it as an address. This works for actual label addresses as well as symbols you set to an integer with .set.

对于 64 位代码中的 msg 地址(不是 msg_size 汇编时间常数),您可能需要
lea rdx, [RIP+msg] 用于静态地址不适合 32 位的 PIE 可执行文件.如何加载函数地址或标记到寄存器

For the msg address (not msg_size assemble-time constant) in 64-bit code, you may want
lea rdx, [RIP+msg] for a PIE executable where static addresses don't fit in 32 bits. How to load address of function or label into register

在 GAS .intel_syntax noprefix 模式下:

In GAS .intel_syntax noprefix mode:

  • OFFSET 符号 的工作原理类似于 AT&T $symbol.这有点像 MASM.
  • symbol 的工作方式类似于 AT&T symbol(即对未知符号的取消引用).
  • [symbol] 在 GAS 和 NASM/YASM 中始终是有效地址,而不是立即数.LEA 不从地址加载,但它仍然使用内存操作数机器编码.(这就是为什么 lea 使用相同的语法).
  • OFFSET symbol works like AT&T $symbol. This is somewhat like MASM.
  • symbol works like AT&T symbol (i.e. a dereference) for unknown symbols.
  • [symbol] is always an effective-address, never an immediate, in GAS and NASM/YASM. LEA doesn't load from the address but it still uses the memory-operand machine encoding. (That's why lea uses the same syntax).

GAS 是一个一次性的汇编器(返回并填写一旦知道符号值).

GAS is a one-pass assembler (which goes back and fills in symbol values once they're known).

它在第一次遇到该行时决定 mov rdx, symbol 的操作码和编码.更早的 msize= .- msg.equ/.set 将使它选择 mov reg, imm32,但后面的指令不会还可见.

It decides on the opcode and encoding for mov rdx, symbol when it first encounters that line. An earlier msize= . - msg or .equ / .set will make it choose mov reg, imm32, but a later directive won't be visible yet.

对于尚未定义的符号的默认假设是 symbol 是某个部分中的地址(就像您使用 symbol: 之类的标签定义它一样,或来自 .set 符号,.).并且因为 GAS .intel_syntax 就像 MASM 而不是 NASM,所以一个裸符号被当作 [symbol] - 一个内存操作数.

The default assumption for not-yet-defined symbols is that symbol is an address in some section (like you get from defining it with a label like symbol:, or from .set symbol, .). And because GAS .intel_syntax is like MASM not NASM, a bare symbol is treated like [symbol] - a memory operand.

如果您将 .setmsg_length=msg_end - msg 指令放在文件顶部,在引用它的指令之前,它们将汇编为 mov reg, imm32 mov-immediate.(不像在 AT&T 语法中,您总是需要一个 $ 来表示立即数,即使对于 1234 这样的数字文字也是如此.)

If you put a .set or msg_length=msg_end - msg directive at the top of your file, before the instructions that reference it, they would assemble to mov reg, imm32 mov-immediate. (Unlike in AT&T syntax where you always need a $ for an immediate even for numeric literals like 1234.)

例如:源代码和反汇编与objdump -dS 交错:
gcc -g -c foo.s 组装,用 objdump -drwC -S -Mintel foo.o 反汇编(用 as --version= GNU 汇编程序(GNU Binutils)2.34).我们得到这个:

For example: source and disassembly interleaved with objdump -dS:
Assembled with gcc -g -c foo.s and disassembled with objdump -drwC -S -Mintel foo.o (with as --version = GNU assembler (GNU Binutils) 2.34). We get this:

0000000000000000 <l1>:
.intel_syntax noprefix

l1:     
mov eax, OFFSET equsym
   0:   b8 01 00 00 00          mov    eax,0x1
mov eax, equsym            #### treated as a load
   5:   8b 04 25 01 00 00 00    mov    eax,DWORD PTR ds:0x1
mov rax, big               #### 32-bit sign-extended absolute load address, even though the constant was unsigned positive
   c:   48 8b 04 25 aa aa aa aa         mov    rax,QWORD PTR ds:0xffffffffaaaaaaaa
mov rdi, OFFSET label
  14:   48 c7 c7 00 00 00 00    mov    rdi,0x0  17: R_X86_64_32S        .text+0x1b

000000000000001b <label>:

label:
nop
  1b:   90                      nop

.equ equsym, . - label            # equsym = 1
big = 0xaaaaaaaa

mov eax, OFFSET equsym
  1c:   b8 01 00 00 00          mov    eax,0x1
mov eax, equsym           #### treated as an immediate
  21:   b8 01 00 00 00          mov    eax,0x1
mov rax, big              #### constant doesn't fit in 32-bit sign extended, assembler can see it when picking encoding so it picks movabs imm64
  26:   48 b8 aa aa aa aa 00 00 00 00   movabs rax,0xaaaaaaaa

使用 mov edx, OFFSET msg_size 将任何符号(甚至数字文字)视为立即数总是安全的,无论它是如何定义的.所以它与 AT&T $ 完全一样,只是当 GAS 已经知道符号值只是一个数字,而不是某个部分的地址时,它是可选的.为了一致性,总是使用 OFFSET msg_size 可能是个好主意,这样你的代码就不会改变意义,如果未来的程序员移动代码,那么数据部分和相关指令是没有的先长一点.(包括那些忘记了这些与大多数汇编程序不同的奇怪细节的未来的你.)

It's always safe to use mov edx, OFFSET msg_size to treat any symbol (or even a numeric literal) as an immediate regardless of how it was defined. So it's exactly like AT&T $ except that it's optional when GAS already knows the symbol value is just a number, not an address in some section. For consistency it's probably a good idea to always use OFFSET msg_size so your code doesn't change meaning if some future programmer moves code around so the data section and related directives are no longer first. (Including future you who's forgotten these strange details that are unlike most assemblers.)

顺便说一句,.set.equ 的同义词,还有symbol=value 语法 用于设置一个与 .set 同义的值.

BTW, .set is a synonym for .equ, and there's also symbol=value syntax for setting a value which is also synonymous to .set.

mov rdx, OFFSET symbol 将组合成 mov r/m64, sign_extended_imm32.除非它是负常数,而不是地址,否则您不希望它的长度很小(远小于 4GiB).您也不希望 movabs r64, imm64 用于地址;这是低效的.

mov rdx, OFFSET symbol will assemble to mov r/m64, sign_extended_imm32. You don't want that for a small length (vastly less than 4GiB) unless it's a negative constant, not an address. You also don't want movabs r64, imm64 for addresses; that's inefficient.

在 GNU/Linux 下将 mov edx, OFFSET symbol 写在位置相关的可执行文件中是安全的,事实上你应该总是这样做或使用 lea rdx, [rip + symbol],除非您正在编写将加载到高 2GB 虚拟地址空间(例如内核)的代码,否则永远不要对 32 位立即数进行符号扩展.如何加载函数地址或标记到寄存器

It's safe under GNU/Linux to write mov edx, OFFSET symbol in a position-dependent executable, and in fact you should always do that or use lea rdx, [rip + symbol], never sign-extended 32-bit immediate unless you're writing code that will be loaded into the high 2GB of virtual address space (e.g. a kernel). How to load address of function or label into register

另见 32 位绝对地址不再允许在 x86-64 Linux 中使用? 了解更多关于 PIE 可执行文件是现代发行版中的默认设置.

See also 32-bit absolute addresses no longer allowed in x86-64 Linux? for more about PIE executables being the default in modern distros.

提示:如果您知道 AT&T 或 NASM 语法,或 NASM 语法,请使用它来生成您想要的编码,然后使用 objdump -Mintel 反汇编以找出.intel_syntax noprefx 的正确语法.

Tip: if you know the AT&T or NASM syntax, or the NASM syntax, for something, use that to produce the encoding you want and then disassemble with objdump -Mintel to find out the right syntax for .intel_syntax noprefx.

但这在这里没有帮助,因为反汇编只会显示数字文字,如 mov edx, 123,而不是 mov edx, OFFSET name_not_in_object_file.查看 gcc -masm=intel 编译器输出也有帮助,但同样编译器会进行自己的常量传播,而不是使用符号来表示汇编时常量.

But that doesn't help here because disassembly will just show the numeric literal like mov edx, 123, not mov edx, OFFSET name_not_in_object_file. Looking at gcc -masm=intel compiler output can also help, but again compilers do their own constant-propagation instead of using symbols for assemble-time constants.

顺便说一句,据我所知,没有任何开源项目包含 GAS intel_syntax 源代码.如果他们使用 gas,他们使用 AT&T 语法.否则他们使用 NASM/YASM.(您有时也会在开源项目中看到 MSVC 内联 asm).

BTW, no open-source projects that I'm aware of contain GAS intel_syntax source code. If they use gas, they use AT&T syntax. Otherwise they use NASM/YASM. (You sometimes also see MSVC inline asm in open source projects).

这更加人为,因为您通常不会使用不是地址的整数常量来执行此操作.我将它包含在这里只是为了显示 GAS 行为的另一个方面,这取决于在其 1 次传递期间的某个点是否定义了符号.

This is a lot more artificial since you wouldn't normally do this with an integer constant that wasn't an address. I include it here just to show another facet of GAS's behaviour depending on a symbol being defined or not at a point during its 1 pass.

如何像[RIP + _a]"这样的 RIP 相关变量引用?在 x86-64 GAS Intel 语法工作中? - [RIP + symbol] 被解释为使用相对寻址来到达 symbol,而不是实际添加两个地址.但是 [RIP + 4] 是按字面意思理解的,作为相对于该指令末尾的偏移量.

How do RIP-relative variable references like "[RIP + _a]" in x86-64 GAS Intel-syntax work? - [RIP + symbol] is interpreted as using relative addressing to reach symbol, not actually adding two addresses. But [RIP + 4] is taken literally, as an offset relative to the end of this instruction.

同样,当 GAS 到达引用它的指令时,它对符号的了解很重要,因为它是 1-pass.如果未定义,则假定它是一个普通符号.如果定义为没有关联部分的数值,则它的工作方式类似于文字数字.

So again, it matters what GAS knows about a symbol when it reaches an instruction that references it, because it's 1-pass. If undefined, it assumes it's a normal symbol. If defined as a numeric value with no section associated, it works like a literal number.

_start:
foo=4
jmpq *foo(%rip)
jmpq *bar(%rip)
bar=4

汇编到第一个跳转与 jmp *4(%rip) 相同,从当前指令末尾的 4 个字节加载一个指针.但是第二次跳转使用 bar 的符号重定位,使用 RIP 相对寻址模式到达符号 bar 的绝对地址,无论结果如何.

That assembles to the first jump being the same as jmp *4(%rip) loading a pointer from 4 bytes past the end of the current instruction. But the 2nd jump using a symbol relocation for bar, using a RIP-relative addressing mode to reach the absolute address of the symbol bar, whatever that may turn out to be.

0000000000000000 <.text>:
   0:   ff 25 04 00 00 00       jmp    QWORD PTR [rip+0x4]        # a <.text+0xa>
   6:   ff 25 00 00 00 00       jmp    QWORD PTR [rip+0x0]        # c <bar+0x8> 8: R_X86_64_PC32        *ABS*

ld foo.o 链接后​​,可执行文件有:

After linking with ld foo.o, the executable has:

  401000:       ff 25 04 00 00 00       jmp    *0x4(%rip)        # 40100a <bar+0x401006>
  401006:       ff 25 f8 ef bf ff       jmp    *-0x401008(%rip)        # 4 <bar>

这篇关于将 GNU 中的内存与常量区分为 .intel_syntax的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆