如果在 64 位代码中使用 32 位 int 0x80 Linux ABI,会发生什么? [英] What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?

查看:34
本文介绍了如果在 64 位代码中使用 32 位 int 0x80 Linux ABI,会发生什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

int 0x80 在 Linux 上总是调用 32 位 ABI,不管它是从什么模式调用的:ebx 中的 args,ecx,...和来自 /usr/include/asm/unistd_32.h 的系统调用号.(或者在没有 CONFIG_IA32_EMULATION 的情况下编译的 64 位内核上崩溃).

64 位代码应该使用 syscall,调用号来自 /usr/include/asm/unistd_64.h,参数在rdirsi 等.参见 UNIX & 的调用约定是什么?Linux 系统调用 i386 和 x86-64.如果您的问题被标记为与此重复,请查看该链接,了解应该如何在 32 位或 64 位代码中进行系统调用的详细信息.如果您想了解什么确实发生了,继续阅读.

(有关 32 位与 64 位 sys_write 的示例,请参阅 在 64 位 Linux 上使用中断 0x80)

<小时>

syscall 系统调用比 int 0x80 系统调用快,因此除非您正在编写多语言机器代码,否则请使用本机 64 位 syscall以 32 位或 64 位执行时运行相同.(sysenter 总是以 32 位模式返回,因此它在 64 位用户空间中没有用,尽管它是一个有效的 x86-64 指令.)

相关:Linux 系统调用权威指南(在 x86 上),了解如何进行 int 0x80sysenter 32 位系统调用,或syscall 64 位系统调用,或为虚拟"系统调用(如 gettimeofday)调用 vDSO.加上系统调用的全部背景知识.

<小时>

使用 int 0x80 可以编写一些以 32 位或 64 位模式组装的东西,因此在代码末尾添加 exit_group() 很方便微基准测试什么的.

标准化函数和系统调用约定的官方 i386 和 x86-64 System V psABI 文档的当前 PDF 链接自 https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI.

查看 标签 wiki 用于初学者指南、x86 手册、官方文档和性能优化指南/资源.

<小时>

但是由于人们不断使用使用 int 0x80 in 64-bit code,或意外从为 32 位编写的源代码构建 64 位二进制文​​件,我想知道究竟在当前的 Linux 上发生了什么?

int 0x80 是否保存/恢复所有 64 位寄存器?它是否将任何寄存器截断为 32 位?如果传递具有非零上半部分的指针 args 会发生什么?

如果你给它传递 32 位指针,它会起作用吗?

解决方案

TL:DR: int 0x80 在正确使用时工作,只要任何指针适合 32位(堆栈指针不适合).但要注意 strace 解码错误 除非你有一个最新的 strace + 内核.

int 0x80 r8-r11 出于原因,并保留其他所有内容.使用它就像在 32 位代码中一样使用 32 位索书号.(或者更好,不要使用它!)

并非所有系统都支持 int 0x80:Linux 版本 1 (WSL1) 的 Windows 子系统严格仅限 64 位:int 0x80 根本不起作用.也可以构建 Linux 内核无需 IA-32 仿真.(不支持 32 位可执行文件,不支持 32 位系统调用).请参阅 re:确保您的 WSL 实际上是 WSL2 (它在 VM 中使用了实际的 Linux 内核.)


细节:保存/恢复的内容,内核使用了哪些部分

int 0x80 使用 eax(不是完整的 rax)作为系统调用号,分派到同一个函数指针表32 位用户空间 int 0x80 使用.(这些指针指向内核内部本地 64 位实现的 sys_whatever 实现或包装器.系统调用实际上是跨越用户/内核边界的函数调用.)

仅传递 arg 寄存器的低 32 位.rbx-rbp 的上半部分被保留,但被 int 0x80 系统调用忽略. 注意传递一个指向系统调用的错误指针不会导致 SIGSEGV;相反,系统调用返回 -EFAULT.如果您不检查错误返回值(使用调试器或跟踪工具),它似乎会默默地失败.

所有寄存器(当然eax除外)都被保存/恢复(包括RFLAGS和整数regs的前32位),除了r8-r11被清零.r12-r15 在 x86-64 SysV ABI 的函数调用约定中被调用保留,因此在 64 位中被 int 0x80 清零的寄存器是被调用破坏的新"的子集AMD64 添加的寄存器.

此行为已通过对内核中寄存器保存方式的一些内部更改得以保留,并且内核中的注释提到它可以从 64 位开始使用,因此此 ABI 可能是稳定的.(即,您可以指望将 r8-r11 归零,而其他所有内容都将被保留.)

返回值经过符号扩展以填充 64 位 rax.(Linux 声明返回 sys_3 函数的位有符号 long.)这意味着指针返回值(如来自 void *mmap())需要在 64 位寻址中使用前进行零扩展模式

sysenter 不同,它保留了 cs 的原始值,因此它以调用它的相同模式返回到用户空间.(使用 sysenter 导致内核将 cs 设置为 $__USER32_CS,它为 32 位代码段选择一个描述符.)


较旧的 strace 对 64 位进程的 int 0x80 解码错误.它解码就像进程使用了​​ syscall 而不是 int 0x80. 可能非常混乱.例如strace 打印 write(0, NULL, 12 for eax=1/int $0x80,实际上是_exit(ebx),而不是write(rdi, rsi, rdx).

我不知道添加 PTRACE_GET_SYSCALL_INFO 功能的确切版本,但 Linux 内核 5.5/strace 5.5 处理它.它误导性地说该进程以 32 位模式运行".但确实解码正确.(示例).


int 0x80 只要所有参数(包括指针)都适合寄存器的低 32 位.默认代码模型(小")中的静态代码和数据就是这种情况在 x86-64 SysV ABI 中.(第 3.5.1 节:已知所有符号都位于 0x000000000x7effffff 范围内的虚拟地址中,因此您可以执行诸如 mov 之类的操作edi, hello (AT&T mov $hello, %edi) 使用 5 字节指令获取指向寄存器的指针).

但是不是位置无关的可执行文件,许多Linux发行版现在将gcc配置为默认制作(并且他们为可执行文件启用 ASLR).比如我在 Arch Linux 上编译了一个 hello.c,并在 main 的开头设置了一个断点.传递给 puts 的字符串常量位于 0x555555554724,因此 32 位 ABI write 系统调用将不起作用.(默认情况下,GDB 禁用 ASLR,因此,如果您在 GDB 内运行,则每次运行时您始终会看到相同的地址.)

Linux 将堆栈置于差距"附近在规范地址的上限和下限之间,即栈顶在 2^48-1.(或随机某处,启用 ASLR).因此,在典型的静态链接可执行文件中,rsp 在进入 _start 时类似于 0x7ffffffffe550,具体取决于 env vars 和 args 的大小.将此指针截断到 esp 不会指向任何有效内存,因此如果您尝试传递截断的堆栈指针,则带有指针输入的系统调用通常会返回 -EFAULT.(如果您将 rsp 截断为 esp 然后对堆栈执行任何操作,例如,如果您将 32 位 asm 源代码构建为 64 位可执行文件,您的程序就会崩溃.)


它在内核中的工作原理:

在Linux源码中,arch/x86/entry/entry_64_compat.S定义了ENTRY(entry_INT80_compat).32 位和 64 位进程在执行 int 0x80 时使用相同的入口点.

entry_64.S 定义了 64 位内核的本地入口点,包括中断/故障处理程序和来自 长模式(又名 64 位模式) 进程.

entry_64_compat.S 定义了从兼容模式到 64 位内核的系统调用入口点,以及 64 位进程中 int 0x80 的特殊情况.(sysenter 在 64 位进程中也可能会进入该入口点,但它会推送 $__USER32_CS,因此它将始终以 32 位模式返回.)有syscall 指令的 32 位版本,在 AMD CPU 上受支持,Linux 也支持它用于来自 32 位进程的快速 32 位系统调用.

我想在 64 位模式下 int 0x80可能用例是如果您想使用 您使用 modify_ldt 安装的自定义代码段描述符.int 0x80 将段寄存器本身推送给 iret,Linux 总是通过 iretint 0x80 系统调用返回.64 位 syscall 入口点将 pt_regs->cs->ss 设置为常量,__USER_CS__USER_DS.(SS 和 DS 使用相同的段描述符是正常的.权限差异是通过分页完成的,而不是分段.)

entry_32.S 定义了 32 位内核的入口点,完全不涉及.

<块引用>

int 0x80入口点="nofollow noreferrer">Linux 4.12 的entry_64_compat.S:

/** 32 位旧系统调用入口.** 32 位 x86 Linux 系统调用传统上使用 INT $0x80* 操作说明.INT $0x80 在这里.** 这个入口点可以被 32 位和 64 位程序用来执行* 32 位系统调用.INT $0x80 的实例可以在* 各种程序和库.它也被 vDSO 使用* __kernel_vsyscall 回退不支持更快的硬件* 输入方法.重新启动的 32 位系统调用也回退到 INT* $0x80 不管最初使用什么指令来执行* 系统调用.** 这被认为是一条缓慢的路径.大多数 libc 不使用它* 在现代硬件上的实现,进程启动期间除外....*/入口(entry_INT80_compat)...(查看完整源代码的 github URL)

代码将 eax 零扩展为 rax,然后将所有寄存器压入内核栈,形成一个 struct pt_regs.当系统调用返回时,它将从这里恢复.它是保存用户空间寄存器的标准布局(对于任何入口点),因此来自其他进程(如 gdb 或 strace)的 ptrace 将读取和/或写入内存,如果他们使用 ptrace 而这个进程在系统调用中.(ptrace 修改寄存器是使其他入口点的返回路径变得复杂的一件事.请参阅注释.)

但它推送 $0 而不是 r8/r9/r10/r11.(sysenter 和 AMD syscall32 入口点为 r8-r15 存储零.)

我认为 r8-r11 的这种归零是为了匹配历史行为.在为所有兼容的系统调用设置完整的pt_regs之前,入口点只保存了提交C 调用被破坏的寄存器.它通过 call *ia32_sys_call_table(, %rax, 8) 直接从 asm 调度,这些函数遵循调用约定,因此它们保留了 rbx, rbprspr12-r15.归零 r8-r11 而不是让它们未定义是 避免信息泄漏从 64 位内核到 32 位用户空间(这可能远远 jmp 到 64 位代码段以读取任何内核留在那里).

当前实现 (Linux 4.12) 从 C 分派 32 位 ABI 系统调用,从 pt_regs<重新加载保存的 ebxecx 等/代码>.(64 位原生系统调用直接从 asm 调度,只需要一个 mov %r10, %rcx 来解决函数和 syscall 之间调用约定的微小差异.不幸的是,它不能总是使用 sysret,因为 CPU 错误使得非规范地址不安全.它确实尝试过,所以快速路径非常快,尽管 syscall 本身仍然需要数十循环.)

无论如何,在当前的 Linux 中,32 位系统调用(包括来自 64 位的 int 0x80)最终会出现在do_syscall_32_irqs_on(struct pt>.codes) 它调度到一个函数指针 ia32_sys_call_table,带有 6 个零扩展参数.这可能避免在更多情况下需要围绕 64 位本机系统调用函数的包装器来保留该行为,因此更多 ia32 表条目可以直接作为本机系统调用实现.

<块引用>

Linux 4.1 代码>arch/1x86/entry/common.c

if (likely(nr ax = ia32_sys_call_table[nr]((unsigned int)regs->bx, (unsigned int)regs->cx,(unsigned int)regs->dx, (unsigned int)regs->si,(unsigned int)regs->di, (unsigned int)regs->bp);}syscall_return_slowpath(regs);

在从 asm 分派 32 位系统调用的旧版 Linux 中(就像 64 位直到 4.151),int80 入口点本身将 args 放在正确的寄存器中,movxchg 指令,使用 32 位寄存器.它甚至使用 mov %edx,%edx 将 EDX 零扩展到 RDX(因为 arg3 碰巧在两种约定中使用相同的寄存器).代码在这里此代码在 sysentersyscall32 入口点中重复.

脚注 1:Linux 4.15(我认为)引入了 Spectre/Meltdown 缓解措施,并对入口点进行了重大改造,使它们成为崩溃案例的蹦床.它还通过存储它们,将所有内容归零,然后调用 C 包装器来重新加载正确宽度输入时保存的结构中的参数.

我打算留下这个答案来描述更简单的机制,因为这里概念上有用的部分是系统调用的内核端涉及使用 EAX 或 RAX 作为函数指针表的索引,以及其他传入的寄存器值复制到调用约定希望 args 去的地方.即 syscall 只是一种调用内核,调用其调度代码的方法.


简单示例/测试程序:

我写了一个简单的 Hello World(在 NASM 语法中),它将所有寄存器设置为非零上半部分,然后使用 int 0x80 进行两次 write() 系统调用>,第一个带有指向 .rodata 中字符串的指针(成功),第二个带有指向堆栈的指针(失败,-EFAULT).

然后它使用本机 64 位 syscall ABI 来 write() 堆栈中的字符(64 位指针),然后再次退出.

因此,所有这些示例都正确使用了 ABI,除了第二个 int 0x80 尝试传递 64 位指针并将其截断.

如果您将其构建为与位置无关的可执行文件,第一个也会失败.(您必须使用相对于 RIP 的 lea 而不是 mov 来将 hello: 的地址放入寄存器中.)

我使用了 gdb,但可以使用您喜欢的任何调试器.使用一个突出显示自上一步以来更改的寄存器.gdbgui 适用于调试 asm 源代码,但不适用于反汇编.尽管如此,它确实有一个至少适用于整数 regs 的寄存器窗格,并且在这个示例中效果很好.

请参阅内联 ;;; 注释,描述系统调用如何更改寄存器

全局 _start_开始:mov rax,0x123456789abcdefmov rbx, raxmov rcx, raxmov rdx, raxmov rsi, raxmov rdi, raxmov rbp, raxmov r8, raxmov r9, raxmov r10, raxmov r11, raxmov r12, raxmov r13, raxmov r14, raxmov r15, rax;;32 位 ABImov rax, 0xffffffff00000004 ;高垃圾 + __NR_write (unistd_32.h)移动 rbx, 0xffffffff00000001 ;高垃圾 + fd=1mov rcx, 0xffffffff00000000 + .hellomov rdx, 0xffffffff00000000 + .hellolen;标准after_setup: ;在这里设置断点整数 0x80 ;写(1,你好,你好);32 位 ABI;;成功,写入标准输出;;;寄存器更改:r8-r11 = 0. rax=14 = 返回值;ebx 仍然 = 1 = STDOUT_FILENO推再见"+(0xa<<(3*8))mov rcx, rsp ;rcx = 64 位指针,如果被截断将不起作用mov edx, 4移动 eax, 4 ;__NR_write (unistd_32.h)整数 0x80 ;写(ebx=1,ecx=截断的指针,edx=4);32位;;失败,没有打印;;;对寄存器的更改:rax=-14 = -EFAULT(来自/usr/include/asm-generic/errno-base.h)移动 r10, rax ;将返回值保存为退出状态移动 r8, r15移动 r9, r15移动 r11, r15 ;再次使这些 regs 非零;;64 位 ABI移动 eax, 1 ;__NR_write (unistd_64.h)mov edi, 1mov rsi, rspmov edx, 4系统调用;write(edi=1, rsi='bye
' 在栈上, rdx=4);64位;;成功:写入标准输出并在 rax 中返回 4;;;对寄存器的更改:rax=4 = 长度返回值;;;rcx = 0x400112 = RIP.r11 = 0x302 = 设置了额外位的 eflags.;;;(这不是巧合,这是 sysret 的工作方式.但不要依赖它,因为 iret 可能会留下其他东西)mov edi, r10d;异或edi,edi移动 eax, 60 ;__NR_exit (unistd_64.h)系统调用;_exit(edi = first int 0x80 结果);64位;;成功,退出状态 = 第一个 int 0x80 结果的低字节 = 14.rodata 节_start.hello: db "Hello World!", 0xa, 0_start.hellolen equ $ - _start.hello

构建带有

的64位静态二进制文件

yasm -felf64 -Worphan-labels -gdwarf2 abi32-from-64.asmld -o abi32-from-64 abi32-from-64.o

运行 gdb ./abi32-from-64.在 gdb 中,如果 ~/.gdbinit<中没有,请运行 set disassembly-flavor intellayout reg/code> 已经.(GAS .intel_syntax 类似于 MASM,而不是 NASM,但它们非常接近,如果您喜欢 NASM 语法,则很容易阅读.)

(gdb) set disassembly-flavor intel(gdb) 布局注册(gdb) b after_setup(gdb) r(gdb) si # 步骤指令按回车重复上一个命令,继续步进

当 gdb 的 TUI 模式混乱时按 control-L.这很容易发生,即使程序本身不打印到标准输出.

int 0x80 on Linux always invokes the 32-bit ABI, regardless of what mode it's called from: args in ebx, ecx, ... and syscall numbers from /usr/include/asm/unistd_32.h. (Or crashes on 64-bit kernels compiled without CONFIG_IA32_EMULATION).

64-bit code should use syscall, with call numbers from /usr/include/asm/unistd_64.h, and args in rdi, rsi, etc. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64. If your question was marked a duplicate of this, see that link for details on how you should make system calls in 32 or 64-bit code. If you want to understand what exactly happened, keep reading.

(For an example of 32-bit vs. 64-bit sys_write, see Using interrupt 0x80 on 64-bit Linux)


syscall system calls are faster than int 0x80 system calls, so use native 64-bit syscall unless you're writing polyglot machine code that runs the same when executed as 32 or 64 bit. (sysenter always returns in 32-bit mode, so it's not useful from 64-bit userspace, although it is a valid x86-64 instruction.)

Related: The Definitive Guide to Linux System Calls (on x86) for how to make int 0x80 or sysenter 32-bit system calls, or syscall 64-bit system calls, or calling the vDSO for "virtual" system calls like gettimeofday. Plus background on what system calls are all about.


Using int 0x80 makes it possible to write something that will assemble in 32 or 64-bit mode, so it's handy for an exit_group() at the end of a microbenchmark or something.

Current PDFs of the official i386 and x86-64 System V psABI documents that standardize function and syscall calling conventions are linked from https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI.

See the tag wiki for beginner guides, x86 manuals, official documentation, and performance optimization guides / resources.


But since people keep posting questions with code that uses int 0x80 in 64-bit code, or accidentally building 64-bit binaries from source written for 32-bit, I wonder what exactly does happen on current Linux?

Does int 0x80 save/restore all the 64-bit registers? Does it truncate any registers to 32-bit? What happens if you pass pointer args that have non-zero upper halves?

Does it work if you pass it 32-bit pointers?

解决方案

TL:DR: int 0x80 works when used correctly, as long as any pointers fit in 32 bits (stack pointers don't fit). But beware that strace decodes it wrong unless you have a very recent strace + kernel.

int 0x80 zeros r8-r11 for reasons, and preserves everything else. Use it exactly like you would in 32-bit code, with the 32-bit call numbers. (Or better, don't use it!)

Not all systems even support int 0x80: The Windows Subsystem for Linux version 1 (WSL1) is strictly 64-bit only: int 0x80 doesn't work at all. It's also possible to build Linux kernels without IA-32 emulation either. (No support for 32-bit executables, no support for 32-bit system calls). See this re: making sure your WSL is actually WSL2 (which uses an actual Linux kernel in a VM.)


The details: what's saved/restored, which parts of which regs the kernel uses

int 0x80 uses eax (not the full rax) as the system-call number, dispatching to the same table of function-pointers that 32-bit user-space int 0x80 uses. (These pointers are to sys_whatever implementations or wrappers for the native 64-bit implementation inside the kernel. System calls are really function calls across the user/kernel boundary.)

Only the low 32 bits of arg registers are passed. The upper halves of rbx-rbp are preserved, but ignored by int 0x80 system calls. Note that passing a bad pointer to a system call doesn't result in SIGSEGV; instead the system call returns -EFAULT. If you don't check error return values (with a debugger or tracing tool), it will appear to silently fail.

All registers (except eax of course) are saved/restored (including RFLAGS, and the upper 32 of integer regs), except that r8-r11 are zeroed. r12-r15 are call-preserved in the x86-64 SysV ABI's function calling convention, so the registers that get zeroed by int 0x80 in 64-bit are the call-clobbered subset of the "new" registers that AMD64 added.

This behaviour has been preserved over some internal changes to how register-saving was implemented inside the kernel, and comments in the kernel mention that it's usable from 64-bit, so this ABI is probably stable. (I.e. you can count on r8-r11 being zeroed, and everything else being preserved.)

The return value is sign-extended to fill 64-bit rax. (Linux declares 32-bit sys_ functions as returning signed long.) This means that pointer return values (like from void *mmap()) need to be zero-extended before use in 64-bit addressing modes

Unlike sysenter, it preserves the original value of cs, so it returns to user-space in the same mode that it was called in. (Using sysenter results in the kernel setting cs to $__USER32_CS, which selects a descriptor for a 32-bit code segment.)


Older strace decodes int 0x80 incorrectly for 64-bit processes. It decodes as if the process had used syscall instead of int 0x80. This can be very confusing. e.g. strace prints write(0, NULL, 12 <unfinished ... exit status 1> for eax=1 / int $0x80, which is actually _exit(ebx), not write(rdi, rsi, rdx).

I don't know the exact version where the PTRACE_GET_SYSCALL_INFO feature was added, but Linux kernel 5.5 / strace 5.5 handle it. It misleadingly says the process "runs in 32-bit mode" but does decode correctly. (Example).


int 0x80 works as long as all arguments (including pointers) fit in the low 32 of a register. This is the case for static code and data in the default code model ("small") in the x86-64 SysV ABI. (Section 3.5.1 : all symbols are known to be located in the virtual addresses in the range 0x00000000 to 0x7effffff, so you can do stuff like mov edi, hello (AT&T mov $hello, %edi) to get a pointer into a register with a 5 byte instruction).

But this is not the case for position-independent executables, which many Linux distros now configure gcc to make by default (and they enable ASLR for executables). For example, I compiled a hello.c on Arch Linux, and set a breakpoint at the start of main. The string constant passed to puts was at 0x555555554724, so a 32-bit ABI write system call would not work. (GDB disables ASLR by default, so you always see the same address from run to run, if you run from within GDB.)

Linux puts the stack near the "gap" between the upper and lower ranges of canonical addresses, i.e. with the top of the stack at 2^48-1. (Or somewhere random, with ASLR enabled). So rsp on entry to _start in a typical statically-linked executable is something like 0x7fffffffe550, depending on size of env vars and args. Truncating this pointer to esp does not point to any valid memory, so system calls with pointer inputs will typically return -EFAULT if you try to pass a truncated stack pointer. (And your program will crash if you truncate rsp to esp and then do anything with the stack, e.g. if you built 32-bit asm source as a 64-bit executable.)


How it works in the kernel:

In the Linux source code, arch/x86/entry/entry_64_compat.S defines ENTRY(entry_INT80_compat). Both 32 and 64-bit processes use the same entry point when they execute int 0x80.

entry_64.S is defines native entry points for a 64-bit kernel, which includes interrupt / fault handlers and syscall native system calls from long mode (aka 64-bit mode) processes.

entry_64_compat.S defines system-call entry-points from compat mode into a 64-bit kernel, plus the special case of int 0x80 in a 64-bit process. (sysenter in a 64-bit process may go to that entry point as well, but it pushes $__USER32_CS, so it will always return in 32-bit mode.) There's a 32-bit version of the syscall instruction, supported on AMD CPUs, and Linux supports it too for fast 32-bit system calls from 32-bit processes.

I guess a possible use-case for int 0x80 in 64-bit mode is if you wanted to use a custom code-segment descriptor that you installed with modify_ldt. int 0x80 pushes segment registers itself for use with iret, and Linux always returns from int 0x80 system calls via iret. The 64-bit syscall entry point sets pt_regs->cs and ->ss to constants, __USER_CS and __USER_DS. (It's normal that SS and DS use the same segment descriptors. Permission differences are done with paging, not segmentation.)

entry_32.S defines entry points into a 32-bit kernel, and is not involved at all.

The int 0x80 entry point in Linux 4.12's entry_64_compat.S:

/*
 * 32-bit legacy system call entry.
 *
 * 32-bit x86 Linux system calls traditionally used the INT $0x80
 * instruction.  INT $0x80 lands here.
 *
 * This entry point can be used by 32-bit and 64-bit programs to perform
 * 32-bit system calls.  Instances of INT $0x80 can be found inline in
 * various programs and libraries.  It is also used by the vDSO's
 * __kernel_vsyscall fallback for hardware that doesn't support a faster
 * entry method.  Restarted 32-bit system calls also fall back to INT
 * $0x80 regardless of what instruction was originally used to do the
 * system call.
 *
 * This is considered a slow path.  It is not used by most libc
 * implementations on modern hardware except during process startup.
 ...
 */
 ENTRY(entry_INT80_compat)
 ...  (see the github URL for the full source)

The code zero-extends eax into rax, then pushes all the registers onto the kernel stack to form a struct pt_regs. This is where it will restore from when the system call returns. It's in a standard layout for saved user-space registers (for any entry point), so ptrace from other process (like gdb or strace) will read and/or write that memory if they use ptrace while this process is inside a system call. (ptrace modification of registers is one thing that makes return paths complicated for the other entry points. See comments.)

But it pushes $0 instead of r8/r9/r10/r11. (sysenter and AMD syscall32 entry points store zeros for r8-r15.)

I think this zeroing of r8-r11 is to match historical behaviour. Before the Set up full pt_regs for all compat syscalls commit, the entry point only saved the C call-clobbered registers. It dispatched directly from asm with call *ia32_sys_call_table(, %rax, 8), and those functions follow the calling convention, so they preserve rbx, rbp, rsp, and r12-r15. Zeroing r8-r11 instead of leaving them undefined was to avoid info leaks from a 64-bit kernel to 32-bit user-space (which could far jmp to a 64-bit code segment to read anything the kernel left there).

The current implementation (Linux 4.12) dispatches 32-bit-ABI system calls from C, reloading the saved ebx, ecx, etc. from pt_regs. (64-bit native system calls dispatch directly from asm, with only a mov %r10, %rcx needed to account for the small difference in calling convention between functions and syscall. Unfortunately it can't always use sysret, because CPU bugs make it unsafe with non-canonical addresses. It does try to, so the fast-path is pretty damn fast, although syscall itself still takes tens of cycles.)

Anyway, in current Linux, 32-bit syscalls (including int 0x80 from 64-bit) eventually end up indo_syscall_32_irqs_on(struct pt_regs *regs). It dispatches to a function pointer ia32_sys_call_table, with 6 zero-extended args. This maybe avoids needing a wrapper around the 64-bit native syscall function in more cases to preserve that behaviour, so more of the ia32 table entries can be the native system call implementation directly.

Linux 4.12 arch/x86/entry/common.c

if (likely(nr < IA32_NR_syscalls)) {
  /*
   * It's possible that a 32-bit syscall implementation
   * takes a 64-bit parameter but nonetheless assumes that
   * the high bits are zero.  Make sure we zero-extend all
   * of the args.
   */
  regs->ax = ia32_sys_call_table[nr](
      (unsigned int)regs->bx, (unsigned int)regs->cx,
      (unsigned int)regs->dx, (unsigned int)regs->si,
      (unsigned int)regs->di, (unsigned int)regs->bp);
}

syscall_return_slowpath(regs);

In older versions of Linux that dispatch 32-bit system calls from asm (like 64-bit still did until 4.151), the int80 entry point itself puts args in the right registers with mov and xchg instructions, using 32-bit registers. It even uses mov %edx,%edx to zero-extend EDX into RDX (because arg3 happen to use the same register in both conventions). code here. This code is duplicated in the sysenter and syscall32 entry points.

Footnote 1: Linux 4.15 (I think) introduced Spectre / Meltdown mitigations, and a major revamp of the entry points that made them them a trampoline for the meltdown case. It also sanitized the incoming registers to avoid user-space values other than actual args being in registers during the call (when some Spectre gadget might run), by storing them, zeroing everything, then calling to a C wrapper that reloads just the right widths of args from the struct saved on entry.

I'm planning to leave this answer describing the much simpler mechanism because the conceptually useful part here is that the kernel side of a syscall involves using EAX or RAX as an index into a table of function pointers, with other incoming register values copied going to the places where the calling convention wants args to go. i.e. syscall is just a way to make a call into the kernel, to its dispatch code.


Simple example / test program:

I wrote a simple Hello World (in NASM syntax) which sets all registers to have non-zero upper halves, then makes two write() system calls with int 0x80, one with a pointer to a string in .rodata (succeeds), the second with a pointer to the stack (fails with -EFAULT).

Then it uses the native 64-bit syscall ABI to write() the chars from the stack (64-bit pointer), and again to exit.

So all of these examples are using the ABIs correctly, except for the 2nd int 0x80 which tries to pass a 64-bit pointer and has it truncated.

If you built it as a position-independent executable, the first one would fail too. (You'd have to use a RIP-relative lea instead of mov to get the address of hello: into a register.)

I used gdb, but use whatever debugger you prefer. Use one that highlights changed registers since the last single-step. gdbgui works well for debugging asm source, but is not great for disassembly. Still, it does have a register pane that works well for integer regs at least, and it worked great on this example.

See the inline ;;; comments describing how register are changed by system calls

global _start
_start:
    mov  rax, 0x123456789abcdef
    mov  rbx, rax
    mov  rcx, rax
    mov  rdx, rax
    mov  rsi, rax
    mov  rdi, rax
    mov  rbp, rax
    mov  r8, rax
    mov  r9, rax
    mov  r10, rax
    mov  r11, rax
    mov  r12, rax
    mov  r13, rax
    mov  r14, rax
    mov  r15, rax

    ;; 32-bit ABI
    mov  rax, 0xffffffff00000004          ; high garbage + __NR_write (unistd_32.h)
    mov  rbx, 0xffffffff00000001          ; high garbage + fd=1
    mov  rcx, 0xffffffff00000000 + .hello
    mov  rdx, 0xffffffff00000000 + .hellolen
    ;std
after_setup:       ; set a breakpoint here
    int  0x80                   ; write(1, hello, hellolen);   32-bit ABI
    ;; succeeds, writing to stdout
;;; changes to registers:   r8-r11 = 0.  rax=14 = return value

    ; ebx still = 1 = STDOUT_FILENO
    push 'bye' + (0xa<<(3*8))
    mov  rcx, rsp               ; rcx = 64-bit pointer that won't work if truncated
    mov  edx, 4
    mov  eax, 4                 ; __NR_write (unistd_32.h)
    int  0x80                   ; write(ebx=1, ecx=truncated pointer,  edx=4);  32-bit
    ;; fails, nothing printed
;;; changes to registers: rax=-14 = -EFAULT  (from /usr/include/asm-generic/errno-base.h)

    mov  r10, rax               ; save return value as exit status
    mov  r8, r15
    mov  r9, r15
    mov  r11, r15               ; make these regs non-zero again

    ;; 64-bit ABI
    mov  eax, 1                 ; __NR_write (unistd_64.h)
    mov  edi, 1
    mov  rsi, rsp
    mov  edx, 4
    syscall                     ; write(edi=1, rsi='bye
' on the stack,  rdx=4);  64-bit
    ;; succeeds: writes to stdout and returns 4 in rax
;;; changes to registers: rax=4 = length return value
;;; rcx = 0x400112 = RIP.   r11 = 0x302 = eflags with an extra bit set.
;;; (This is not a coincidence, it's how sysret works.  But don't depend on it, since iret could leave something else)

    mov  edi, r10d
    ;xor  edi,edi
    mov  eax, 60                ; __NR_exit (unistd_64.h)
    syscall                     ; _exit(edi = first int 0x80 result);  64-bit
    ;; succeeds, exit status = low byte of first int 0x80 result = 14

section .rodata
_start.hello:    db "Hello World!", 0xa, 0
_start.hellolen  equ   $ - _start.hello

Build it into a 64-bit static binary with

yasm -felf64 -Worphan-labels -gdwarf2 abi32-from-64.asm
ld -o abi32-from-64 abi32-from-64.o

Run gdb ./abi32-from-64. In gdb, run set disassembly-flavor intel and layout reg if you don't have that in your ~/.gdbinit already. (GAS .intel_syntax is like MASM, not NASM, but they're close enough that it's easy to read if you like NASM syntax.)

(gdb)  set disassembly-flavor intel
(gdb)  layout reg
(gdb)  b  after_setup
(gdb)  r
(gdb)  si                     # step instruction
    press return to repeat the last command, keep stepping

Press control-L when gdb's TUI mode gets messed up. This happens easily, even when programs don't print to stdout themselves.

这篇关于如果在 64 位代码中使用 32 位 int 0x80 Linux ABI,会发生什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆