你好,有 Linux 系统调用的汇编语言世界? [英] Hello, world in assembly language with Linux system calls?

查看:25
本文介绍了你好,有 Linux 系统调用的汇编语言世界?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  1. 我知道 int 0x80 在 linux 中造成中断.但是,我不明白这段代码是如何工作的.它会返回什么吗?

  2. $ - msg 代表什么?

全局 _start.data 节msg db "你好,世界!", 0x0alen equ $ - 味精节.text_开始:移动轴,4mov ebx, 1mov ecx, msgmov edx, lenint 0x80 ;这是什么?移动轴,1移动 ebx, 0int 0x80 ;这是什么?

解决方案

$ 在 NASM 中究竟是如何工作的? 解释了如何$ - msg 让 NASM 为您计算字符串长度作为汇编时间常数,而不是对其进行硬编码.

<小时>

我最初为 SO Docs(主题 ID:1164,例如ID:19078),重写了@runner 一个评论较少的基本示例. 这看起来比 我对另一个问题的部分回答SO docs 实验结束后,我之前将它移到了哪里.

<小时>

进行系统调用是通过将参数放入寄存器,然后运行 ​​int 0x80(32 位模式)或 syscall(64 位模式)来完成的.什么是UNIX & 的调用约定Linux 系统调用 i386 和 x86-64Linux 系统调用权威指南.

int 0x80 视为跨越用户/内核权限边界调用"内核的一种方式.内核根据之前的值执行操作int 0x80 执行时在寄存器中,然后最终返回.返回值在 EAX 中.

当执行到达内核的入口点时,它会查看 EAX 并根据 EAX 中的调用号分派到正确的系统调用.来自其他寄存器的值作为函数参数传递给该系统调用的内核处理程序.(例如 eax=4/int 0x80 将使内核调用其 sys_write 内核函数,实现 POSIX write 系统调用.)>

另见 如果在 64 位代码中使用 32 位 int 0x80 Linux ABI 会发生什么? - 该答案包括查看内核入口点中的 asm 是由 int 0x80 调用.(也适用于 32 位用户空间,而不仅仅是 64 位,您不应该使用 int 0x80).

<小时>

如果您还不了解底层 Unix 系统编程,您可能只想在 asm 中编写函数,这些函数接受 args 并返回一个值(或通过指针 arg 更新数组)并从 C 或 C++ 程序中调用它们.然后,您只需担心学习如何处理寄存器和内存,而无需学习 POSIX 系统调用 API 和使用它的 ABI.这也使得将您的代码与 C 实现的编译器输出进行比较变得非常容易.编译器通常在编写高效代码方面做得很好,但是 很少是完美的.

libc 为系统调用提供包装函数,因此编译器生成的代码将 call write 而不是直接使用 int 0x80 调用它(或者如果您关心性能,<代码>系统输入器).(在 x86-64 代码中,对 64 位 ABI 使用 syscall.)另见 系统调用(2).

系统调用记录在第 2 部分手册页中,例如 写(2).有关 libc 包装函数和底层 Linux 系统调用之间的差异,请参阅 NOTES 部分.请注意,sys_exit 的包装器是 _exit(2),而不是 exit(3) ISO C 函数,首先刷新 stdio 缓冲区和其他清理.还有一个 exit_group 系统调用 结束所有线程.exit(3) 实际上使用了它,因为在单线程进程中没有缺点.

此代码进行了 2 个系统调用:

我对它进行了大量评论(以至于它开始在没有颜色语法突出显示的情况下掩盖实际代码).这是试图向初学者指出一些事情,而不是您应该如何正常评论代码.

section .text ;可执行代码位于 .text 部分全局 _start ;链接器寻找这个符号来设置进程入口点,所以执行从这里开始;;;后跟冒号的名称定义了一个符号.global _start 指令对其进行修改,使其成为一个全局符号,而不仅仅是我们可以从 asm 内部调用或 JMP 的符号.;;;请注意,_start 并不是真正的函数".您无法从中返回,并且内核传递 argc、argv 和 env 的方式与 main() 预期的不同._开始:;;;写(1,味精,len);;首先将参数移动到寄存器中,内核将在其中查找它们mov edx,len ;edx 中的第三个参数:缓冲区长度mov ecx,msg ;第二个参数进入 ecx:指向缓冲区的指针;将输出设置为标准输出(转到您的终端,或您重定向或管道的任何地方)mov ebx,1 ;第一个参数进入 ebx:Unix 文件描述符.1 = stdout,通常连接到终端.移动 eax,4 ;系统调用号(来自 SYS_write/__NR_write 来自 unistd_32.h).整数 0x80 ;产生一个中断,激活内核的系统调用处理代码.64 位代码使用不同的指令、不同的寄存器和不同的调用号.;;eax = 返回值,所有其他寄存器不变.;;;二、退出进程.没有什么可返回的,所以我们不能使用 ret 指令(就像我们可以使用 main() 或任何带有调用者的函数一样);;;如果我们不退出,则继续执行内存页面中的下一个字节,;;;通常会导致分段错误,因为填充 00 00 解码为添加 [eax],al.;;;_退出(0);异或 ebx,ebx ;第一个 arg = 退出状态 = 0.(将被截断为 8 位).清零寄存器是 x86 上的一个特例,并且 mov ebx,0 效率较低.;;省略 ebx 的归零将意味着我们退出(1),即具有错误状态,因为 ebx 仍然保持之前的 1.移动 eax,1 ;将 __NR_exit 放入 eaxint 0x80 ;执行Linux函数.rodata 节;只读常量部分;;msg 是一个标签,在这种情况下不需要是 msg:.它可以在单独的行上.;;db = 数据字节:将一些文字字节组装到输出文件中.msg db '你好,世界!',0xa ;ASCII 字符串常量加换行符 (0x10);;不需要终止零字节,因为我们使用的是 write(),它采用缓冲区 + 长度而不是隐式长度字符串.;;为了使它成为我们可以传递给 puts 或 strlen 的 C 字符串,我们需要一个终止的 0 字节.(例如...",0x10,0)len 等 $ - 味精;定义一个汇编时常量(不会自己存储在输出文件中,但会在使用它的 insns 中作为立即数出现);计算 len = 字符串长度.减去起始地址;从当前位置 ($) 开始的字符串;;等效地,我们可以在字符串后面放一个 str_end: 标签并完成 len equ str_end - str

请注意,我们不会将字符串长度存储在数据存储器中的任何位置.它是一个汇编时间常数,因此将其作为立即操作数比将其作为负载更有效.我们也可以使用三个 push imm32 指令将字符串数据压入堆栈,但是代码量过大并不是一件好事.

<小时>

在 Linux 上,您可以将此文件保存为 Hello.asm使用这些命令从中构建 32 位可执行文件:

nasm -felf32 Hello.asm # 组装为 32 位代码.添加 -Worphan-labels -g -Fdwarf 用于调试符号和警告gcc -static -nostdlib -m32 Hello.o -o Hello # 链接不带 CRT 启动代码或 libc,生成静态二进制文件

这个答案 有关将程序​​集构建为 32 位或 64 位静态或动态链接的 Linux 可执行文件的更多详细信息,对于 NASM/YASM 语法或带有 GNU as 指令的 GNU AT&T 语法.(关键点:在 64 位主机上构建 32 位代码时,请确保使用 -m32 或等效项,否则您将在运行时遇到令人困惑的问题.)

<小时>

您可以使用 strace 跟踪它的执行情况,以查看它进行的系统调用:

$ strace ./Helloexecve("./Hello", ["./Hello"], [/* 72 vars */]) = 0[ 进程 PID=4019 以 32 位模式运行.]write(1, "Hello, world!
", 14Hello, world!) = 14_退出(0)=?+++ 以 0 +++ 退出

将此与动态链接进程的跟踪(如 gcc 从 hello.c 或运行 strace/bin/ls 生成)进行比较,以了解幕后发生了多少事情用于动态链接和 C 库启动.

stderr 上的跟踪和 stdout 上的常规输出都到这里的终端,因此它们干扰了 write 系统调用的行.如果您愿意,可以重定向或跟踪到文件.请注意这如何让我们轻松查看系统调用返回值,而无需添加代码来打印它们,实际上比使用常规调试器(如 gdb)单步执行并查看 eax 更容易这.有关 gdb asm 提示,请参阅 x86 标记 wiki 的底部.(标签 wiki 的其余部分充满了优质资源的链接.)

这个程序的 x86-64 版本将非常相似,将相同的参数传递给相同的系统调用,只是在不同的寄存器中,并且使用 syscall 而不是 int 0x80>.见底部如果您在 64 位代码中使用 32 位 int 0x80 Linux ABI 会发生什么? 有关编写字符串并以 64 位代码退出的工作示例.

<小时>

相关:关于为 Linux 创建真正小巧的 ELF 可执行文件的旋风教程.您可以运行的最小二进制文件,它只进行 exit() 系统调用.这是关于最小化二进制大小,而不是源代码大小,甚至只是实际运行的指令数量.

  1. I know that int 0x80 is making interrupt in linux. But, I don't understand how this code works. Does it returning something?

  2. What $ - msg standing for?

global _start

section .data
    msg db "Hello, world!", 0x0a
    len equ $ - msg

section .text
_start:
    mov eax, 4
    mov ebx, 1
    mov ecx, msg
    mov edx, len
    int 0x80 ;What is this?
    mov eax, 1
    mov ebx, 0
    int 0x80 ;and what is this?

How does $ work in NASM, exactly? explains how $ - msg gets NASM to calculate the string length as an assemble-time constant for you, instead of hard-coding it.


I originally wrote the rest of this for SO Docs (topic ID: 1164, example ID: 19078), rewriting a basic less-well-commented example by @runner. This looks like a better place to put it than as part of my answer to another question where I had previously moved it after the SO docs experiment ended.


Making a system call is done by putting arguments into registers, then running int 0x80 (32-bit mode) or syscall (64-bit mode). What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 and The Definitive Guide to Linux System Calls.

Think of int 0x80 as a way to "call" into the kernel, across the user/kernel privilege boundary. The kernel does stuff according to the values that were in registers when int 0x80 executed, then eventually returns. The return value is in EAX.

When execution reaches the kernel's entry point, it looks at EAX and dispatches to the right system call based on the call number in EAX. Values from other registers are passed as function args to the kernel's handler for that system call. (e.g. eax=4 / int 0x80 will get the kernel to call its sys_write kernel function, implementing the POSIX write system call.)

And see also What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? - that answer includes a look at the asm in the kernel entry point that is "called" by int 0x80. (Also applies to 32-bit user-space, not just 64-bit where you shouldn't use int 0x80).


If you don't already know low-level Unix systems programming, you might want to just write functions in asm that take args and return a value (or update arrays via a pointer arg) and call them from C or C++ programs. Then you can just worry about learning how to handle registers and memory, without also learning the POSIX system-call API and the ABI for using it. That also makes it very easy to compare your code with compiler output for a C implementation. Compilers usually do a pretty good job at making efficient code, but are rarely perfect.

libc provides wrapper functions for system calls, so compiler-generated code would call write rather than invoking it directly with int 0x80 (or if you care about performance, sysenter). (In x86-64 code, use syscall for the 64-bit ABI.) See also syscalls(2).

System calls are documented in section 2 manual pages, like write(2). See the NOTES section for differences between the libc wrapper function and the underlying Linux system call. Note that the wrapper for sys_exit is _exit(2), not the exit(3) ISO C function that flushes stdio buffers and other cleanup first. There's also an exit_group system call that ends all threads. exit(3) actually uses that, because there's no downside in a single-threaded process.

This code makes 2 system calls:

I commented it heavily (to the point where it it's starting to obscure the actual code without color syntax highlighting). This is an attempt to point things out to total beginners, not how you should comment your code normally.

section .text             ; Executable code goes in the .text section
global _start             ; The linker looks for this symbol to set the process entry point, so execution start here
;;;a name followed by a colon defines a symbol.  The global _start directive modifies it so it's a global symbol, not just one that we can CALL or JMP to from inside the asm.
;;; note that _start isn't really a "function".  You can't return from it, and the kernel passes argc, argv, and env differently than main() would expect.
 _start:
    ;;; write(1, msg, len);
    ; Start by moving the arguments into registers, where the kernel will look for them
    mov     edx,len       ; 3rd arg goes in edx: buffer length
    mov     ecx,msg       ; 2nd arg goes in ecx: pointer to the buffer
    ;Set output to stdout (goes to your terminal, or wherever you redirect or pipe)
    mov     ebx,1         ; 1st arg goes in ebx: Unix file descriptor. 1 = stdout, which is normally connected to the terminal.

    mov     eax,4         ; system call number (from SYS_write / __NR_write from unistd_32.h).
    int     0x80          ; generate an interrupt, activating the kernel's system-call handling code.  64-bit code uses a different instruction, different registers, and different call numbers.
    ;; eax = return value, all other registers unchanged.

    ;;;Second, exit the process.  There's nothing to return to, so we can't use a ret instruction (like we could if this was main() or any function with a caller)
    ;;; If we don't exit, execution continues into whatever bytes are next in the memory page,
    ;;; typically leading to a segmentation fault because the padding 00 00 decodes to  add [eax],al.

    ;;; _exit(0);
    xor     ebx,ebx       ; first arg = exit status = 0.  (will be truncated to 8 bits).  Zeroing registers is a special case on x86, and mov ebx,0 would be less efficient.
                      ;; leaving out the zeroing of ebx would mean we exit(1), i.e. with an error status, since ebx still holds 1 from earlier.
    mov     eax,1         ; put __NR_exit into eax
    int     0x80          ;Execute the Linux function

section     .rodata       ; Section for read-only constants

             ;; msg is a label, and in this context doesn't need to be msg:.  It could be on a separate line.
             ;; db = Data Bytes: assemble some literal bytes into the output file.
msg     db  'Hello, world!',0xa     ; ASCII string constant plus a newline (0x10)

             ;;  No terminating zero byte is needed, because we're using write(), which takes a buffer + length instead of an implicit-length string.
             ;; To make this a C string that we could pass to puts or strlen, we'd need a terminating 0 byte. (e.g. "...", 0x10, 0)

len     equ $ - msg       ; Define an assemble-time constant (not stored by itself in the output file, but will appear as an immediate operand in insns that use it)
                          ; Calculate len = string length.  subtract the address of the start
                          ; of the string from the current position ($)
  ;; equivalently, we could have put a str_end: label after the string and done   len equ str_end - str

Notice that we don't store the string length in data memory anywhere. It's an assemble-time constant, so it's more efficient to have it as an immediate operand than a load. We could also have pushed the string data onto the stack with three push imm32 instructions, but bloating the code-size too much isn't a good thing.


On Linux, you can save this file as Hello.asm and build a 32-bit executable from it with these commands:

nasm -felf32 Hello.asm                  # assemble as 32-bit code.  Add -Worphan-labels -g -Fdwarf  for debug symbols and warnings
gcc -static -nostdlib -m32 Hello.o -o Hello     # link without CRT startup code or libc, making a static binary

See this answer for more details on building assembly into 32 or 64-bit static or dynamically linked Linux executables, for NASM/YASM syntax or GNU AT&T syntax with GNU as directives. (Key point: make sure to use -m32 or equivalent when building 32-bit code on a 64-bit host, or you will have confusing problems at run-time.)


You can trace its execution with strace to see the system calls it makes:

$ strace ./Hello 
execve("./Hello", ["./Hello"], [/* 72 vars */]) = 0
[ Process PID=4019 runs in 32 bit mode. ]
write(1, "Hello, world!
", 14Hello, world!
)         = 14
_exit(0)                                = ?
+++ exited with 0 +++

Compare this with the trace for a dynamically linked process (like gcc makes from hello.c, or from running strace /bin/ls) to get an idea just how much stuff happens under the hood for dynamic linking and C library startup.

The trace on stderr and the regular output on stdout are both going to the terminal here, so they interfere in the line with the write system call. Redirect or trace to a file if you care. Notice how this lets us easily see the syscall return values without having to add code to print them, and is actually even easier than using a regular debugger (like gdb) to single-step and look at eax for this. See the bottom of the x86 tag wiki for gdb asm tips. (The rest of the tag wiki is full of links to good resources.)

The x86-64 version of this program would be extremely similar, passing the same args to the same system calls, just in different registers and with syscall instead of int 0x80. See the bottom of What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for a working example of writing a string and exiting in 64-bit code.


related: A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux. The smallest binary file you can run that just makes an exit() system call. That is about minimizing the binary size, not the source size or even just the number of instructions that actually run.

这篇关于你好,有 Linux 系统调用的汇编语言世界?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆