设计 AT&T 汇编语法的最初原因是什么? [英] What was the original reason for the design of AT&T assembly syntax?

查看:26
本文介绍了设计 AT&T 汇编语法的最初原因是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 x86 或 amd64 上使用汇编指令时,程序员可以使用Intel"(即 nasm 编译器)或AT&T"(即 gas 编译器)汇编语法.Intel"语法在 Windows 上更流行,但AT&T"在 UNIX(类)系统上更流行.

但是 Intel 和 AMD 手册,即芯片创建者创建的手册,都使用Intel"语法.

我想知道,AT&T"语法设计背后的最初想法是什么?远离处理器创建者使用的符号有什么好处?

解决方案

UNIX 在 PDP-11 上开发了很长时间,PDP-11 是来自 DEC 的 16 位计算机,具有相当简单的指令集.几乎每条指令都有两个操作数,每个操作数都可以有以下八种寻址模式中的一种,这里以 MACRO 16 汇编语言显示:

0n Rn 寄存器1n (Rn) 延期2n (Rn)+ 自增3n @(Rn)+ 自增延迟4n -(Rn) 自减5n @-(Rn) 自动递减延迟6n X(Rn) 指数7n @X(Rn) 索引延迟

通过巧妙地重新使用 R7 上的一些寻址模式,程序计数器,可以对立即数和直接地址进行编码:

27 #imm 立即37 @#imm 绝对67 相对地址77 @addr 相对延迟

由于 UNIX tty 驱动程序使用 @# 作为控制字符,$ 被替换为 #* 表示 @.

PDP11 指令字中的第一个操作数指的是源操作数,而第二个操作数指的是目标操作数.这反映在汇编语言的操作数顺序中,即源,然后是目标.例如,操作码

011273

参考说明

mov (R2),R3

R2指向的词移动到R3.

此语法适用于 8086 CPU 及其寻址模式:

mr0 X(bx,si) bx + si 索引mr1 X(bx,di) bx + di 索引mr2 X(bp,si) bp + si 索引mr3 X(bp,di) bp + di 索引mr4 X(si) si 索引mr5 X(di) di 索引mr6 X(bp) bp 索引mr7 X(bx) bx 索引3rR R 寄存器0r6 地址直接

其中m如果没有索引为0,m如果有1字节索引则为1,如果m为2如果使用寄存器而不是内存操作数,则有一个两字节索引并且 m 为 3.如果存在两个操作数,则另一个操作数始终是一个寄存器并以 r 数字编码.否则,r 编码另外三位操作码.

在这种寻址方案中,立即数是不可能的,所有采用立即数的指令都在它们的操作码中编码了这一事实.立即数拼写为 $imm,就像在 PDP-11 语法中一样.

虽然 Intel 的汇编器总是使用 dst, src 操作数顺序,但没有特别令人信服的理由来适应这个约定,UNIX 汇编器被编写为使用 src, dst PDP11 中已知的操作数顺序.

他们在 8087 浮点指令的实现中与此顺序存在一些不一致,可能是因为 Intel 为非交换浮点指令的两个可能方向提供了不同的助记符,这些助记符与 AT&T 语法使用的操作数顺序不匹配.

PDP11 指令jmp(跳转)和jsr(跳转到子程序)跳转到它们操作数的地址.因此,jmp foo 会跳转到 foojmp *foo 会跳转到存储在变量 foo 中的地址,类似于 lea 在 8086 中的工作方式.

x86 的 jmpcall 指令的语法被设计为好像这些指令在 PDP11 上工作一样,这就是 jmp foo跳转到 foo 并且 jmp *foo 跳转到地址 foo 处的值,即使 8086 实际上没有延迟寻址.这具有在语法上区分直接跳转和间接跳转的优点和便利,无需为每个直接跳转目标添加 $ 前缀,但在逻辑上没有多大意义.

语法被扩展为使用冒号指定段前缀:

seg:addr

当 80386 推出时,该方案使用四部分通用寻址模式适应其新的 SIB 寻址模式:

disp(base,index,scale)

其中 disp 是位移,base 是基址寄存器,index 是索引寄存器,scale 是 1、2、4 或8 以这些数量之一来缩放索引寄存器.这等同于 Intel 语法:

[disp+base+index*scale]

PDP-11 的另一个显着特点是大多数指令都以字节和字的形式提供.您使用的是操作码的 bw 后缀,直接切换操作码的第一位:

 010001 movw r0,r1110001 movb r0,r1

这也适用于 AT&T 语法,因为大多数 8086 指令确实也可以在字节模式和字模式下使用.后来80386和AMD K6引入了32位指令(long后缀l)和64位指令(quad后缀q).>

最后但并非最不重要的一点,最初的约定是用下划线作为 C 语言符号的前缀(在 Windows 上仍然如此),以便您可以将名为 ax 的 C 函数与寄存器 区分开来斧头.当 Unix System Laboratories 开发 ELF 二进制格式时,他们决定摆脱这种装饰.由于无法区分直接地址和寄存器,因此在每个寄存器中添加了 % 前缀:

mov direct,%eax # 将内存直接移动到 %eax

这就是我们如何获得今天的 AT&T 语法.

When using assembly instructions on x86 or amd64, programmer can use "Intel" (i.e. nasm compiler) or "AT&T" (i.e. gas compiler) assembly syntax. "Intel" syntax is more popular on Windows, but "AT&T" is more popular on UNIX(-like) systems.

But both Intel and AMD manuals, so manuals created by the creators of the chip, are both using the "Intel" syntax.

I'm wondering, what was the original idea behind the design of the "AT&T" syntax? What was the benefit for floating away from notation used by the creators of the processor?

解决方案

UNIX was for a long time developed on the PDP-11, a 16 bit computer from DEC, which had a fairly simple instruction set. Nearly every instruction has two operands, each of which can have one of the following eight addressing modes, here shown in the MACRO 16 assembly language:

0n  Rn        register
1n  (Rn)      deferred
2n  (Rn)+     autoincrement
3n  @(Rn)+    autoincrement deferred
4n  -(Rn)     autodecrement
5n  @-(Rn)    autodecrement deferred
6n  X(Rn)     index
7n  @X(Rn)    index deferred

Immediates and direct addresses can be encoded by cleverly re-using some addressing modes on R7, the program counter:

27  #imm      immediate
37  @#imm     absolute
67  addr      relative
77  @addr     relative deferred

As the UNIX tty driver used @ and # as control characters, $ was substituted for # and * for @.

The first operand in a PDP11 instruction word refers to the source operand while the second operand refers to the destination. This is reflected in the assembly language's operand order which is source, then destination. For example, the opcode

011273

refers to the instruction

mov (R2),R3

which moves the word pointed to by R2 to R3.

This syntax was adapted to the 8086 CPU and its addressing modes:

mr0 X(bx,si)  bx + si indexed
mr1 X(bx,di)  bx + di indexed
mr2 X(bp,si)  bp + si indexed
mr3 X(bp,di)  bp + di indexed
mr4 X(si)     si indexed
mr5 X(di)     di indexed
mr6 X(bp)     bp indexed
mr7 X(bx)     bx indexed
3rR R         register
0r6 addr      direct

Where m is 0 if there is no index, m is 1 if there is a one-byte index, m is 2 if there is a two-byte index and m is 3 if instead of a memory operand, a register is used. If two operands exist, the other operand is always a register and encoded in the r digit. Otherwise, r encodes another three bits of the opcode.

Immediates aren't possible in this addressing scheme, all instructions that take immediates encode that fact in their opcode. Immediates are spelled $imm just like in the PDP-11 syntax.

While Intel always used a dst, src operand ordering for its assembler, there was no particularly compelling reason to adapt this convention and the UNIX assembler was written to use the src, dst operand ordering known from the PDP11.

They made some inconsistencies with this ordering in their implementation of the 8087 floating point instructions, possibly because Intel gave the two possible directions of non-commutative floating point instructions different mnemonics which do not match the operand ordering used by AT&T's syntax.

The PDP11 instructions jmp (jump) and jsr (jump to subroutine) jump to the address of their operand. Thus, jmp foo would jump to foo and jmp *foo would jump to the address stored in the variable foo, similar to how lea works in the 8086.

The syntax for the x86's jmp and call instructions was designed as if these instructions worked like on the PDP11, which is why jmp foo jumps to foo and jmp *foo jumps to the value at address foo, even though the 8086 doesn't actually have deferred addressing. This has the advantage and convenience of syntactically distinguishing direct jumps from indirect jumps without requiring an $ prefix for every direct jump target but doesn't make a lot of sense logically.

The syntax was expanded to specify segment prefixes using a colon:

seg:addr

When the 80386 was introduced, this scheme was adapted to its new SIB addressing modes using a four-part generic addressing mode:

disp(base,index,scale)

where disp is a displacement, base is a base register, index an index register and scale is 1, 2, 4, or 8 to scale the index register by one of these amounts. This is equal to Intel syntax:

[disp+base+index*scale]

Another remarkable feature of the PDP-11 is that most instructions are available in a byte and a word variant. Which one you use is indicated by a b or w suffix to the opcode, which directly toggles the first bit of the opcode:

 010001   movw r0,r1
 110001   movb r0,r1

this also was adapted for AT&T syntax as most 8086 instructions are indeed also available in a byte mode and a word mode. Later the 80386 and AMD K6 introduced 32 bit instructions (suffixed l for long) and 64 bit instructions (suffixed q for quad).

Last but not least, originally the convention was to prefix C language symbols with an underscore (as is still done on Windows) so you can distinguish a C function named ax from the register ax. When Unix System Laboratories developed the ELF binary format, they decided to get rid of this decoration. As there is no way to distinguish a direct address from a register otherwise, a % prefix was added to every register:

mov direct,%eax # move memory at direct to %eax

And that's how we got today's AT&T syntax.

这篇关于设计 AT&T 汇编语法的最初原因是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆