设计 AT&T 汇编语法的最初原因是什么? [英] What was the original reason for the design of AT&T assembly syntax?
问题描述
在 x86 或 amd64 上使用汇编指令时,程序员可以使用Intel"(即 nasm
编译器)或AT&T"(即 gas
编译器)汇编语法.Intel"语法在 Windows 上更流行,但AT&T"在 UNIX(类)系统上更流行.
但是 Intel 和 AMD 手册,即芯片创建者创建的手册,都使用Intel"语法.
我想知道,AT&T"语法设计背后的最初想法是什么?远离处理器创建者使用的符号有什么好处?
UNIX 在 PDP-11 上开发了很长时间,PDP-11 是来自 DEC 的 16 位计算机,具有相当简单的指令集.几乎每条指令都有两个操作数,每个操作数都可以有以下八种寻址模式中的一种,这里以 MACRO 16 汇编语言显示:
0n Rn 寄存器1n (Rn) 延期2n (Rn)+ 自增3n @(Rn)+ 自增延迟4n -(Rn) 自减5n @-(Rn) 自动递减延迟6n X(Rn) 指数7n @X(Rn) 索引延迟
通过巧妙地重新使用 R7 上的一些寻址模式,程序计数器,可以对立即数和直接地址进行编码:
27 #imm 立即37 @#imm 绝对67 相对地址77 @addr 相对延迟
由于 UNIX tty 驱动程序使用 @
和 #
作为控制字符,$
被替换为 #
和*
表示 @
.
PDP11 指令字中的第一个操作数指的是源操作数,而第二个操作数指的是目标操作数.这反映在汇编语言的操作数顺序中,即源,然后是目标.例如,操作码
011273
参考说明
mov (R2),R3
将R2
指向的词移动到R3
.
此语法适用于 8086 CPU 及其寻址模式:
mr0 X(bx,si) bx + si 索引mr1 X(bx,di) bx + di 索引mr2 X(bp,si) bp + si 索引mr3 X(bp,di) bp + di 索引mr4 X(si) si 索引mr5 X(di) di 索引mr6 X(bp) bp 索引mr7 X(bx) bx 索引3rR R 寄存器0r6 地址直接
其中m
如果没有索引为0,m
如果有1字节索引则为1,如果m
为2如果使用寄存器而不是内存操作数,则有一个两字节索引并且 m
为 3.如果存在两个操作数,则另一个操作数始终是一个寄存器并以 r
数字编码.否则,r
编码另外三位操作码.
在这种寻址方案中,立即数是不可能的,所有采用立即数的指令都在它们的操作码中编码了这一事实.立即数拼写为 $imm
,就像在 PDP-11 语法中一样.
虽然 Intel 的汇编器总是使用 dst, src
操作数顺序,但没有特别令人信服的理由来适应这个约定,UNIX 汇编器被编写为使用 src, dst
PDP11 中已知的操作数顺序.
他们在 8087 浮点指令的实现中与此顺序存在一些不一致,可能是因为 Intel 为非交换浮点指令的两个可能方向提供了不同的助记符,这些助记符与 AT&T 语法使用的操作数顺序不匹配.
PDP11 指令jmp
(跳转)和jsr
(跳转到子程序)跳转到它们操作数的地址.因此,jmp foo
会跳转到 foo
而 jmp *foo
会跳转到存储在变量 foo
中的地址,类似于 lea
在 8086 中的工作方式.
x86 的 jmp
和 call
指令的语法被设计为好像这些指令在 PDP11 上工作一样,这就是 jmp foo
跳转到 foo
并且 jmp *foo
跳转到地址 foo
处的值,即使 8086 实际上没有延迟寻址.这具有在语法上区分直接跳转和间接跳转的优点和便利,无需为每个直接跳转目标添加 $
前缀,但在逻辑上没有多大意义.
语法被扩展为使用冒号指定段前缀:
seg:addr
当 80386 推出时,该方案使用四部分通用寻址模式适应其新的 SIB 寻址模式:
disp(base,index,scale)
其中 disp
是位移,base 是基址寄存器,index
是索引寄存器,scale
是 1、2、4 或8 以这些数量之一来缩放索引寄存器.这等同于 Intel 语法:
[disp+base+index*scale]
PDP-11 的另一个显着特点是大多数指令都以字节和字的形式提供.您使用的是操作码的 b
或 w
后缀,直接切换操作码的第一位:
010001 movw r0,r1110001 movb r0,r1
这也适用于 AT&T 语法,因为大多数 8086 指令确实也可以在字节模式和字模式下使用.后来80386和AMD K6引入了32位指令(long
后缀l
)和64位指令(quad后缀q
).>
最后但并非最不重要的一点,最初的约定是用下划线作为 C 语言符号的前缀(在 Windows 上仍然如此),以便您可以将名为 ax
的 C 函数与寄存器 区分开来斧头
.当 Unix System Laboratories 开发 ELF 二进制格式时,他们决定摆脱这种装饰.由于无法区分直接地址和寄存器,因此在每个寄存器中添加了 %
前缀:
mov direct,%eax # 将内存直接移动到 %eax
这就是我们如何获得今天的 AT&T 语法.
When using assembly instructions on x86 or amd64, programmer can use "Intel" (i.e. nasm
compiler) or "AT&T" (i.e. gas
compiler) assembly syntax. "Intel" syntax is more popular on Windows, but "AT&T" is more popular on UNIX(-like) systems.
But both Intel and AMD manuals, so manuals created by the creators of the chip, are both using the "Intel" syntax.
I'm wondering, what was the original idea behind the design of the "AT&T" syntax? What was the benefit for floating away from notation used by the creators of the processor?
UNIX was for a long time developed on the PDP-11, a 16 bit computer from DEC, which had a fairly simple instruction set. Nearly every instruction has two operands, each of which can have one of the following eight addressing modes, here shown in the MACRO 16 assembly language:
0n Rn register
1n (Rn) deferred
2n (Rn)+ autoincrement
3n @(Rn)+ autoincrement deferred
4n -(Rn) autodecrement
5n @-(Rn) autodecrement deferred
6n X(Rn) index
7n @X(Rn) index deferred
Immediates and direct addresses can be encoded by cleverly re-using some addressing modes on R7, the program counter:
27 #imm immediate
37 @#imm absolute
67 addr relative
77 @addr relative deferred
As the UNIX tty driver used @
and #
as control characters, $
was substituted for #
and *
for @
.
The first operand in a PDP11 instruction word refers to the source operand while the second operand refers to the destination. This is reflected in the assembly language's operand order which is source, then destination. For example, the opcode
011273
refers to the instruction
mov (R2),R3
which moves the word pointed to by R2
to R3
.
This syntax was adapted to the 8086 CPU and its addressing modes:
mr0 X(bx,si) bx + si indexed
mr1 X(bx,di) bx + di indexed
mr2 X(bp,si) bp + si indexed
mr3 X(bp,di) bp + di indexed
mr4 X(si) si indexed
mr5 X(di) di indexed
mr6 X(bp) bp indexed
mr7 X(bx) bx indexed
3rR R register
0r6 addr direct
Where m
is 0 if there is no index, m
is 1 if there is a one-byte index, m
is 2 if there is a two-byte index and m
is 3 if instead of a memory operand, a register is used. If two operands exist, the other operand is always a register and encoded in the r
digit. Otherwise, r
encodes another three bits of the opcode.
Immediates aren't possible in this addressing scheme, all instructions that take immediates encode that fact in their opcode. Immediates are spelled $imm
just like in the PDP-11 syntax.
While Intel always used a dst, src
operand ordering for its assembler, there was no particularly compelling reason to adapt this convention and the UNIX assembler was written to use the src, dst
operand ordering known from the PDP11.
They made some inconsistencies with this ordering in their implementation of the 8087 floating point instructions, possibly because Intel gave the two possible directions of non-commutative floating point instructions different mnemonics which do not match the operand ordering used by AT&T's syntax.
The PDP11 instructions jmp
(jump) and jsr
(jump to subroutine) jump to the address of their operand. Thus, jmp foo
would jump to foo
and jmp *foo
would jump to the address stored in the variable foo
, similar to how lea
works in the 8086.
The syntax for the x86's jmp
and call
instructions was designed as if these instructions worked like on the PDP11, which is why jmp foo
jumps to foo
and jmp *foo
jumps to the value at address foo
, even though the 8086 doesn't actually have deferred addressing. This has the advantage and convenience of syntactically distinguishing direct jumps from indirect jumps without requiring an $
prefix for every direct jump target but doesn't make a lot of sense logically.
The syntax was expanded to specify segment prefixes using a colon:
seg:addr
When the 80386 was introduced, this scheme was adapted to its new SIB addressing modes using a four-part generic addressing mode:
disp(base,index,scale)
where disp
is a displacement, base is a base register, index
an index register and scale
is 1, 2, 4, or 8 to scale the index register by one of these amounts. This is equal to Intel syntax:
[disp+base+index*scale]
Another remarkable feature of the PDP-11 is that most instructions are available in a byte and a word variant. Which one you use is indicated by a b
or w
suffix to the opcode, which directly toggles the first bit of the opcode:
010001 movw r0,r1
110001 movb r0,r1
this also was adapted for AT&T syntax as most 8086 instructions are indeed also available in a byte mode and a word mode. Later the 80386 and AMD K6 introduced 32 bit instructions (suffixed l
for long
) and 64 bit instructions (suffixed q
for quad).
Last but not least, originally the convention was to prefix C language symbols with an underscore (as is still done on Windows) so you can distinguish a C function named ax
from the register ax
. When Unix System Laboratories developed the ELF binary format, they decided to get rid of this decoration. As there is no way to distinguish a direct address from a register otherwise, a %
prefix was added to every register:
mov direct,%eax # move memory at direct to %eax
And that's how we got today's AT&T syntax.
这篇关于设计 AT&T 汇编语法的最初原因是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!