使用 AT&T 语法将整数打印为字符串,使用 Linux 系统调用而不是 printf [英] Printing an integer as a string with AT&T syntax, with Linux system calls instead of printf

查看:18
本文介绍了使用 AT&T 语法将整数打印为字符串,使用 Linux 系统调用而不是 printf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个汇编程序来显示遵循 AT&T 语法的数字的阶乘.但它不起作用.这是我的代码

.text.globl _start_开始:movq $5,%rcxmovq $5,%rax重复:#function 来计算阶乘解码%rcxcmp $0,%rcx打印imul %rcx,%raxcmp $1,%rcx重复# 现在将阶乘结果存储在 rax 中打印:xorq %rsi, %rsi# 函数通过推入逐位打印整数结果#堆环形:movq $0, %rdxmovq 10 美元,%rbxdivq %rbxaddq $48, %rdxpushq %rdxincq %rsicmpq $0, %rax下一个jmp循环下一个:cmpq $0, %rsi再见popq %rcx解码%rsimovq $4,%raxmovq $1, %rbxmovq $1, %rdx整数 $0x80addq $4, %rsp下一个再见:movq $1,%raxmovq $0, %rbx整数 $0x80.数据数字 : .byte 5

这个程序没有打印任何内容,我也使用 gdb 来可视化它在循环函数之前工作正常,但是当它进入下一个随机值时,一些随机值开始进入各种寄存器.帮我调试一下,让它可以打印阶乘.

解决方案

正如@ped7g 指出的,您做错了几件事:在 64 位中使用 int 0x80 32 位 ABI代码,并传递字符值而不是指向 write() 系统调用的指针.

这是在 x8-64 Linux 中打印整数的方法,简单且有点高效1 方式,使用相同的重复除法/除以 10 的模数.>

系统调用很昂贵(write(1, buf, 1) 可能需要数千个周期),并且在循环内执行 syscall 步骤寄存器,因此很不方便笨重而且效率低下.我们应该按照打印顺序(最低地址处的最高有效数字)将字符写入一个小缓冲区,然后对其进行单个 write() 系统调用.

但是我们需要一个缓冲区.64 位整数的最大长度只有 20 位十进制数字,因此我们可以使用一些堆栈空间.在 x86-64 Linux 中,我们可以使用低于 RSP(最多 128B)的堆栈空间而无需保留"堆栈空间.它通过修改 RSP.这称为 .如果您想将缓冲区传递给另一个函数而不是系统调用,则必须使用 sub $24, %rsp 或其他东西来保留空间.

代替硬编码系统调用号,使用 GAS 可以轻松使用 .h 文件中定义的常量. 注意 mov $__NR_write, %eax 接近函数的结尾.x86-64 SystemV ABI 将类似寄存器中的系统调用参数传递给函数调用约定.(所以它与 32 位 int 0x80 ABI 完全不同,你不应在 64 位代码中使用.)

//使用 gcc foo.S 构建将在 GAS 之前使用 CPP,因此我们可以使用标头#include //这是一个标准的 Linux/glibc 头文件//包括 unistd_64.h 或 unistd_32.h,具体取决于当前模式//仅包含#define 常量(无 C 原型),因此我们可以从 asm 中包含它而不会出现语法错误..p2对齐 4.globl print_integer #void print_uint64(uint64_t 值)打印_uint64:lea -1(%rsp), %rsi # 我们使用 128B 红区作为缓冲区来保存字符串# 一个 64 位整数在基数 10 中最多有 20 位长,所以它适合.movb $'
', (%rsi) # 存储尾随的换行字节.(就在退货地址的正下方).# 如果您需要一个以空字符结尾的字符串,请留出额外的字节空间并存储 '
'.或者按 $'
'mov $10, %ecx # 与 mov $10, %rcx 相同,但短 2 个字节# 请注意,换行符 (
) 的 ASCII 代码为 10,因此我们实际上可以使用 movb %cl, (%rsi) 存储换行符以节省代码大小.mov %rdi, %rax # 函数 arg 到达 RDI;我们需要它在 RAX 中用于 div.Ltoascii_digit: # 做{异或 %edx, %edxdiv %rcx # rax = rdx:rax/10. rdx = 余数# 以 MSD 优先打印顺序存储数字,从字符串的末尾向后工作将 $'0', %edx # 整数添加到 ASCII.%dl 也可以,因为我们知道这是 0-9十二月%rsimov %dl, (%rsi) # *--p = (value%10) + '0';测试 %rax, %raxjnz .Ltoascii_digit # } while(value != 0)# 如果我们使用循环计数器打印固定数量的数字,我们会得到前导零# do{}while() 循环结构意味着循环至少运行一次,所以我们得到0
";输入=0# 然后用一个系统调用打印整个字符串mov $__NR_write, %eax # 来自 asm/unistd_64.h 的调用号mov $1, %edi # fd=1# %rsi = 缓冲区的开始移动 %rsp, %rdxsub %rsi, %rdx # 长度 = one_past_end - 开始系统调用# write(fd=1/*rdi*/, buf/*rsi*/, length/*rdx*/);64 位 ABI# rax = 返回值(或 -errno)# rcx 和 r11 = 垃圾(被 syscall/sysret 破坏)# 所有其他寄存器 = 未修改(由内核保存/恢复)# 我们不需要恢复任何寄存器,我们也没有修改 RSP.退

为了测试这个函数,我把它放在同一个文件中调用它并退出:

.p2align 4.globl _start_开始:mov $10120123425329922, %rdi# mov $0, %edi # 是的,它确实适用于 input = 0调用 print_uint64异或 %edi, %edimov $__NR_exit, %eax系统调用# sys_exit(0)

我将它构建成一个静态二进制文件(没有 libc):

$ gcc -Wall -static -nostdlib print-integer.S &&./a.out10120123425329922$ strace ./a.out >/开发/空execve("./a.out", ["./a.out"], 0x7fffcb097340/* 51 vars */) = 0写(1,10120123425329922
",18)= 18退出(0)=?+++ 以 0 +++ 退出$文件./a.out./a.out:ELF 64 位 LSB 可执行文件,x86-64,版本 1 (SYSV),静态链接,BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84,未剥离


脚注 1: 参见 为什么 GCC 在实现整数除法时使用乘以一个奇怪的数? 以避免 div r64 除以 10,因为那很慢(英特尔 Skylake 上的 21 到 83 个周期).乘法逆将使这个函数实际上有效,而不仅仅是有点".(当然还有优化的空间...)



相关:Linux x86-32 扩展精度循环,从每个 32 位肢体"打印 9 个十进制数字:请参阅 .toascii_digit:在我的极限斐波那契代码高尔夫答案中.它针对代码大小进行了优化(即使以牺牲速度为代价),但得到了很好的评论.

它像您一样使用 div,因为这比使用快速乘法逆运算要小).它使用 loop 作为外循环(超过多个整数以获得扩展精度),再次用于 以速度为代价的代码大小.

它使用 32 位 int 0x80 ABI,并打印到保存旧"代码的缓冲区中.斐波那契值,而不是当前值.


另一种获得高效 asm 的方法是从 C 编译器. 对于数字循环,看看 gcc 或 clang 为这个 C 源产生了什么(这基本上是 asm 正在做的).Godbolt 编译器资源管理器让您可以轻松尝试不同的选项和不同的编译器版本.

gcc7.2 -O3 asm output 这几乎是 中循环的替代品print_uint64(因为我选择 args 进入相同的寄存器):

void itoa_end(unsigned long val, char *p_end) {const 无符号基数 = 10;做 {*--p_end = (val % base) + '0';val/= 基数;} while(val);//写(1, p_end, orig-current);}

我通过注释掉 syscall 指令并在函数调用周围放置一个重复循环来测试 Skylake i7-6700k 的性能.带有 mul %rcx/shr​​ $3, %rdx 的版本比带有 div %rcx 的版本快 5 倍,用于存储长数字-string (10120123425329922) 放入缓冲区.div 版本每时钟运行 0.25 条指令,而 mul 版本每时钟运行 2.65 条指令(尽管需要更多指令).

可能值得展开 2,然后除以 100,然后将其余部分分成 2 位数字.这将提供更好的指令级并行性,以防更简单的版本在 mul + shr​​ 延迟上遇到瓶颈.使 val 为零的乘法/移位运算链将减半,每个独立的短依赖链需要更多的工作来处理 0-99 的余数.


相关:

I have written a Assembly program to display the factorial of a number following AT&T syntax. But it's not working. Here is my code

.text 

.globl _start

_start:
movq $5,%rcx
movq $5,%rax


Repeat:                     #function to calculate factorial
   decq %rcx
   cmp $0,%rcx
   je print
   imul %rcx,%rax
   cmp $1,%rcx
   jne Repeat
# Now result of factorial stored in rax
print:
     xorq %rsi, %rsi

  # function to print integer result digit by digit by pushing in 
       #stack
  loop:
    movq $0, %rdx
    movq $10, %rbx
    divq %rbx
    addq $48, %rdx
    pushq %rdx
    incq %rsi
    cmpq $0, %rax
    jz   next
    jmp loop

  next:
    cmpq $0, %rsi
    jz   bye
    popq %rcx
    decq %rsi
    movq $4, %rax
    movq $1, %rbx
    movq $1, %rdx
    int  $0x80
    addq $4, %rsp
    jmp  next
bye:
movq $1,%rax
movq $0, %rbx
int  $0x80


.data
   num : .byte 5

This program is printing nothing, I also used gdb to visualize it work fine until loop function but when it comes in next some random value start entering in various register. Help me to debug so that it could print factorial.

解决方案

As @ped7g points out, you're doing several things wrong: using the int 0x80 32-bit ABI in 64-bit code, and passing character values instead of pointers to the write() system call.

Here's how to print an integer in x8-64 Linux, the simple and somewhat-efficient1 way, using the same repeated division / modulo by 10.

System calls are expensive (probably thousands of cycles for write(1, buf, 1)), and doing a syscall inside the loop steps on registers so it's inconvenient and clunky as well as inefficient. We should write the characters into a small buffer, in printing order (most-significant digit at the lowest address), and make a single write() system call on that.

But then we need a buffer. The maximum length of a 64-bit integer is only 20 decimal digits, so we can just use some stack space. In x86-64 Linux, we can use stack space below RSP (up to 128B) without "reserving" it by modifying RSP. This is called the . If you wanted to pass the buffer to another function instead of a syscall, you would have to reserve space with sub $24, %rsp or something.

Instead of hard-coding system-call numbers, using GAS makes it easy to use the constants defined in .h files. Note the mov $__NR_write, %eax near the end of the function. The x86-64 SystemV ABI passes system-call arguments in similar registers to the function-calling convention. (So it's totally different from the 32-bit int 0x80 ABI, which you shouldn't use in 64-bit code.)

// building with  gcc foo.S  will use CPP before GAS so we can use headers
#include <asm/unistd.h>    // This is a standard Linux / glibc header file
      // includes unistd_64.h or unistd_32.h depending on current mode
      // Contains only #define constants (no C prototypes) so we can include it from asm without syntax errors.

.p2align 4
.globl print_integer            #void print_uint64(uint64_t value)
print_uint64:
    lea   -1(%rsp), %rsi        # We use the 128B red-zone as a buffer to hold the string
                                # a 64-bit integer is at most 20 digits long in base 10, so it fits.

    movb  $'
', (%rsi)         # store the trailing newline byte.  (Right below the return address).
    # If you need a null-terminated string, leave an extra byte of room and store '
'.  Or  push $'
'

    mov    $10, %ecx            # same as  mov $10, %rcx  but 2 bytes shorter
    # note that newline (
) has ASCII code 10, so we could actually have stored the newline with  movb %cl, (%rsi) to save code size.

    mov    %rdi, %rax           # function arg arrives in RDI; we need it in RAX for div
.Ltoascii_digit:                # do{
    xor    %edx, %edx
    div    %rcx                  #  rax = rdx:rax / 10.  rdx = remainder

                                 # store digits in MSD-first printing order, working backwards from the end of the string
    add    $'0', %edx            # integer to ASCII.  %dl would work, too, since we know this is 0-9
    dec    %rsi
    mov    %dl, (%rsi)           # *--p = (value%10) + '0';

    test   %rax, %rax
    jnz  .Ltoascii_digit        # } while(value != 0)
    # If we used a loop-counter to print a fixed number of digits, we would get leading zeros
    # The do{}while() loop structure means the loop runs at least once, so we get "0
" for input=0

    # Then print the whole string with one system call
    mov   $__NR_write, %eax     # call number from asm/unistd_64.h
    mov   $1, %edi              # fd=1
    # %rsi = start of the buffer
    mov   %rsp, %rdx
    sub   %rsi, %rdx            # length = one_past_end - start
    syscall                     # write(fd=1 /*rdi*/, buf /*rsi*/, length /*rdx*/); 64-bit ABI
    # rax = return value (or -errno)
    # rcx and r11 = garbage (destroyed by syscall/sysret)
    # all other registers = unmodified (saved/restored by the kernel)

    # we don't need to restore any registers, and we didn't modify RSP.
    ret

To test this function, I put this in the same file to call it and exit:

.p2align 4
.globl _start
_start:
    mov    $10120123425329922, %rdi
#    mov    $0, %edi    # Yes, it does work with input = 0
    call   print_uint64

    xor    %edi, %edi
    mov    $__NR_exit, %eax
    syscall                             # sys_exit(0)

I built this into a static binary (with no libc):

$ gcc -Wall -static -nostdlib print-integer.S && ./a.out 
10120123425329922
$ strace ./a.out  > /dev/null
execve("./a.out", ["./a.out"], 0x7fffcb097340 /* 51 vars */) = 0
write(1, "10120123425329922
", 18)     = 18
exit(0)                                 = ?
+++ exited with 0 +++
$ file ./a.out 
./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84, not stripped


Footnote 1: See Why does GCC use multiplication by a strange number in implementing integer division? for avoiding div r64 for division by 10, because that's very slow (21 to 83 cycles on Intel Skylake). A multiplicative inverse would make this function actually efficient, not just "somewhat". (But of course there'd still be room for optimizations...)



Related: Linux x86-32 extended-precision loop that prints 9 decimal digits from each 32-bit "limb": see .toascii_digit: in my Extreme Fibonacci code-golf answer. It's optimized for code-size (even at the expense of speed), but well-commented.

It uses div like you do, because that's smaller than using a fast multiplicative inverse). It uses loop for the outer loop (over multiple integer for extended precision), again for code-size at the cost of speed.

It uses the 32-bit int 0x80 ABI, and prints into a buffer that was holding the "old" Fibonacci value, not the current.


Another way to get efficient asm is from a C compiler. For just the loop over digits, look at what gcc or clang produce for this C source (which is basically what the asm is doing). The Godbolt Compiler explorer makes it easy to try with different options and different compiler versions.

See gcc7.2 -O3 asm output which is nearly a drop-in replacement for the loop in print_uint64 (because I chose the args to go in the same registers):

void itoa_end(unsigned long val, char *p_end) {
  const unsigned base = 10;
  do {
    *--p_end = (val % base) + '0';
    val /= base;
  } while(val);

  // write(1, p_end, orig-current);
}

I tested performance on a Skylake i7-6700k by commenting out the syscall instruction and putting a repeat loop around the function call. The version with mul %rcx / shr $3, %rdx is about 5 times faster than the version with div %rcx for storing a long number-string (10120123425329922) into a buffer. The div version ran at 0.25 instructions per clock, while the mul version ran at 2.65 instructions per clock (although requiring many more instructions).

It might be worth unrolling by 2, and doing a divide by 100 and splitting up the remainder of that into 2 digits. That would give a lot better instruction-level parallelism, in case the simpler version bottlenecks on mul + shr latency. The chain of multiply/shift operations that brings val to zero would be half as long, with more work in each short independent dependency chain to handle a 0-99 remainder.


Related:

这篇关于使用 AT&T 语法将整数打印为字符串,使用 Linux 系统调用而不是 printf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆