这个没有 libc 的 C 程序如何工作? [英] How does this C program without libc work?

查看：37 发布时间：2021/9/4 18:41:45 c assembly x86-64 system-calls abi

本文介绍了这个没有 libc 的 C 程序如何工作?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到了一个没有 libc 的最小 HTTP 服务器:https://github.com/Francesco149/nolibc-httpd

我可以看到定义了基本的字符串处理函数，导致了 write 系统调用:

#define fprint(fd, s) write(fd, s, strlen(s))#define fprintn(fd, s, n) write(fd, s, n)#define fprintl(fd, s) fprintn(fd, s, sizeof(s) - 1)#define fprintln(fd, s) fprintl(fd, s "\n")#define print(s) fprint(1, s)#define printn(s, n) fprintn(1, s, n)#define printl(s) fprintl(1, s)#define println(s) fprintln(1, s)

基本的系统调用在 C 文件中声明:

size_t read(int fd, void *buf, size_t nbyte);ssize_t 写(int fd，const void *buf，size_t nbyte)；int open(const char *path, int flags);int close(int fd);int socket(int domain, int type, int protocol);int 接受(int 套接字，sockaddr_in_t *限制地址，socklen_t *restrict address_len);int 关机(int 套接字，int 如何)；int bind(int socket, const sockaddr_in_t *address, socklen_t address_len);int 监听(int 套接字，int 积压)；int setsockopt(int socket, int level, int option_name, const void *option_value,socklen_t option_len);int fork();无效退出(int状态)；

所以我猜魔术发生在 start.S 中，它包含 _start 和一种通过创建全局标签来编码系统调用的特殊方式，这些标签通过并在 r9 中累积值节省字节:

.intel_syntax noprefix/* 函数:rdi, rsi, rdx, rcx, r8, r9 *//* 系统调用:rdi, rsi, rdx, r10, r8, r9 *//* ^^^ *//* 堆栈从高地址增长到低地址 */#define c(x, n) \.global x;\X:;\添加 r9,nc(exit, 3)/* 60 */c(fork, 3)/* 57 */c(setsockopt, 4)/* 54 */c(听，1)/* 50 */c(绑定，1)/* 49 */c(关机, 5)/* 48 */c(接受, 2)/* 43 */c(socket, 38)/* 41 */c(关闭, 1)/* 03 */c(打开, 1)/* 02 */c(写, 1)/* 01 */.global 读取/* 00 */读:mov r10,rcxmov rax,r9异或 r9,r9系统调用回复.global _start_开始:异或rbp，rbp异或 r9,r9弹出 rdi/* argc */mov rsi,rsp/* argv */调用主呼叫出口

这种理解正确吗?GCC 使用start.S 中定义的符号作为系统调用，然后程序从_start 开始并从C 文件中调用main?>

另外，单独的 httpd.asm 自定义二进制文件是如何工作的?只是结合 C 源代码和启动程序集的手工优化程序集?

解决方案

(我克隆了 repo 并调整了 .c 和 .S 以使用 clang -Oz 更好地编译:992 字节，从使用 gcc 的原始 1208 减少.见我的叉子中的 WIP-clang-tuning 分支，直到我开始清理它并发送拉取请求.使用 clang，系统调用的内联 asm 确实节省了总体大小，尤其是当 main 没有调用也没有 rets 时.如果我想打高尔夫球，IDK从编译器输出重新生成后的整个 .asm；肯定有其中的大块可以显着节省，例如在循环中使用 lodsb.)

看起来他们需要 r9 成为 0 before 调用这些标签中的任何一个，要么使用寄存器全局变量，要么gcc -ffixed-r9 告诉 GCC 保留它永久地摆脱了该注册.否则 GCC 会在 r9 中留下任何垃圾，就像其他寄存器一样.

他们的函数是用普通原型声明的，而不是 6 个带有虚拟 0 args 的参数来让每个调用点实际上为零 r9，所以这不是他们的做法

<块引用>

编码系统调用的特殊方式

我不会将其描述为编码系统调用".也许定义系统调用包装函数".他们正在为每个系统调用定义自己的包装函数，以一种优化的方式进入底部的一个公共处理程序.在 C 编译器的 asm 输出中，您仍会看到 call write.

(对于最终的二进制文件，使用内联 asm 让编译器将 syscall 指令与正确寄存器中的 args 内联可能更紧凑，而不是让它看起来像一个普通函数这会破坏所有调用破坏的寄存器.特别是如果使用 clang -Oz 编译，它将使用 3 字节 push 2/pop rax 而不是5 字节 mov eax, 2 设置调用号.push imm8/pop/syscall 是与 call rel32 的大小相同.)

是的，您可以使用 .global foo/foo: 在手写 asm 中定义函数.您可以将其视为具有多个用于不同系统调用的入口点的大型函数.在 asm 中，除非您使用跳转/调用/ret 指令，否则无论标签如何，执行总是传递到下一条指令.CPU 不知道标签.

所以它就像一个 C switch(){} 语句，在 case: 标签之间没有 break; ，或者像 C 标签一样，你可以使用 goto 跳转到.当然，除了在 asm 中，您可以在全局范围内执行此操作，而在 C 中，您只能转到函数内.在 asm 中，您可以 call 而不是 goto (jmp).

 static long callnum = 0;//r9 = 0 在调用任何这些之前...插座:电话号码 += 38;关闭:电话号码++；//可以使用 inc 而不是 add 1open://错过了 asm 中的优化电话号码++；写:电话号码++；读:tmp=callnum;电话号码=0;retval = 系统调用(tmp，args)；

或者如果你把它改写成一串尾调用，我们甚至可以省略 jmp foo 而只是失败:像这样的 C 真的可以编译成手写的 asm，如果你有一个足够聪明的编译器.(你可以解决 arg-type

register long callnum asm("r9");//GCC 扩展长开(参数...){电话号码++；返回写(参数...)；}长写(参数...){电话号码++；返回读取(参数...)；//尾调用}长读(参数...){tmp=callnum;电话号码=0;//为下一次调用重置callnum返回系统调用(tmp，args...)；}

args... 是参数传递寄存器(RDI、RSI、RDX、RCX、R8)，它们只是保持不变.R9 是 x86-64 System V 的最后一个 arg-passing 寄存器，但他们没有使用任何需要 6 个 args 的系统调用.setsockopt 需要 5 个参数，所以他们不能跳过 mov r10, rcx.但是他们能够将 r9 用于其他用途，而不是需要它来传递第 6 个参数.

有趣的是，他们努力以牺牲性能为代价来节省字节，但仍然使用 xor rbp,rbp 而不是 xor ebp,ebp.除非他们使用 gcc -Wa,-Os start.S 构建，否则 GAS 不会为你优化掉 REX 前缀.(GCC 是否优化汇编源文件?)

他们可以用 xchg rax, r9(2 个字节，包括 REX)而不是 mov rax, r9 (REX + opcode + modrm) 保存另一个字节.(代码 Golf.SE x86 机器码提示)

我也使用过 xchg eax, r9d 因为我知道 Linux 系统调用号适合 32 位，尽管它不会节省代码大小，因为仍然需要 REX 前缀来编码r9d 注册号.此外，在他们只需要添加 1 的情况下，inc r9d 只有 3 个字节，而 add r9d, 1 是 4 个字节(REX + opcode + modrm +im8).(inc 的 no-modrm 短格式编码仅在 32 位模式下可用；在 64 位模式下，它被重新用作 REX 前缀.)

mov rsi,rsp 也可以将一个字节保存为 push rsp/pop rsi(每个 1 字节)而不是 3 字节的 REX+ 移动.这将为在 call exit 之前使用 xchg edi, eax 返回 main 的返回值腾出空间.

但是由于他们没有使用 libc，他们可以内联 exit，或者将系统调用放在下面 _start 这样他们就可以掉下来进入它，因为 exit 恰好是编号最高的系统调用！或者至少 jmp exit 因为它们不需要堆栈对齐，而且 jmp rel8 比 call rel32 更紧凑.

<块引用>

另外，单独的 httpd.asm 自定义二进制文件是如何工作的?只是结合 C 源代码和启动程序集的手工优化程序集?

不，那是完全独立的包含 start.S 代码(在 ?_017: 标签)，也许还有手动调整的编译器输出.也许是因为手动调整了链接可执行文件的反汇编，因此即使是手写汇编中的部分也没有很好的标签名称.(具体来说，来自 Agner Fog 的 objconv，它使用该格式其 NASM 语法反汇编中的标签.)

(Ruslan 在cmp 之后还指出了jnz 之类的东西，而不是jne 对人类有更合适的语义意义，所以另一个它是编译器输出的标志，不是手写的.)

我不知道他们是如何安排让编译器不接触 r9 的.看来只是运气.自述文件表明只需编译 .c 和 .S 文件，以及它们的 GCC 版本即可.

至于 ELF 标头，请参阅文件顶部的注释，其中链接关于为 Linux 创建真正小巧的 ELF 可执行文件的旋风教程 - 您可以使用 nasm -fbin 组装它，输出是一个完整的 ELF 二进制文件，可以运行. 不是需要链接 + 剥离的 .o，因此您可以考虑文件中的每个字节.

I came across a minimal HTTP server that is written without libc: https://github.com/Francesco149/nolibc-httpd

I can see that basic string handling functions are defined, leading to the write syscall:

#define fprint(fd, s) write(fd, s, strlen(s))
#define fprintn(fd, s, n) write(fd, s, n)
#define fprintl(fd, s) fprintn(fd, s, sizeof(s) - 1)
#define fprintln(fd, s) fprintl(fd, s "\n")
#define print(s) fprint(1, s)
#define printn(s, n) fprintn(1, s, n)
#define printl(s) fprintl(1, s)
#define println(s) fprintln(1, s)

And the basic syscalls are declared in the C file:

size_t read(int fd, void *buf, size_t nbyte);
ssize_t write(int fd, const void *buf, size_t nbyte);
int open(const char *path, int flags);
int close(int fd);
int socket(int domain, int type, int protocol);
int accept(int socket, sockaddr_in_t *restrict address,
           socklen_t *restrict address_len);
int shutdown(int socket, int how);
int bind(int socket, const sockaddr_in_t *address, socklen_t address_len);
int listen(int socket, int backlog);
int setsockopt(int socket, int level, int option_name, const void *option_value,
               socklen_t option_len);
int fork();
void exit(int status);

So I guess the magic happens in start.S, which contains _start and a special way of encoding syscalls by creating global labels which fall through and accumulating values in r9 to save bytes:

.intel_syntax noprefix

/* functions: rdi, rsi, rdx, rcx, r8, r9 */
/*  syscalls: rdi, rsi, rdx, r10, r8, r9 */
/*                           ^^^         */
/* stack grows from a high address to a low address */

#define c(x, n) \
.global x; \
x:; \
  add r9,n

c(exit, 3)       /* 60 */
c(fork, 3)       /* 57 */
c(setsockopt, 4) /* 54 */
c(listen, 1)     /* 50 */
c(bind, 1)       /* 49 */
c(shutdown, 5)   /* 48 */
c(accept, 2)     /* 43 */
c(socket, 38)    /* 41 */
c(close, 1)      /* 03 */
c(open, 1)       /* 02 */
c(write, 1)      /* 01 */
.global read     /* 00 */
read:
  mov r10,rcx
  mov rax,r9
  xor r9,r9
  syscall
  ret

.global _start
_start:
  xor rbp,rbp
  xor r9,r9
  pop rdi     /* argc */
  mov rsi,rsp /* argv */
  call main
  call exit

Is this understanding correct? GCC use the symbols defined in start.S for the syscalls, then the program starts in _start and calls main from the C file?

Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?

解决方案

(I cloned the repo and tweaked the .c and .S to compile better with clang -Oz: 992 bytes, down from the original 1208 with gcc. See the WIP-clang-tuning branch in my fork, until I get around to cleaning that up and sending a pull request. With clang, inline asm for the syscalls does save size overall, especially once main has no calls and no rets. IDK if I want to hand-golf the whole .asm after regenerating from compiler output; there are certainly chunks of it where significant savings are possible, e.g. using lodsb in loops.)

It looks like they need r9 to be 0 before a call to any of these labels, either with a register global var or maybe gcc -ffixed-r9 to tell GCC to keep its hands off that register permanently. Otherwise GCC would have left whatever garbage in r9, just like other registers.

Their functions are declared with normal prototypes, not 6 args with dummy 0 args to get every call site to actually zero r9, so that's not how they're doing it.

special way of encoding syscalls

I wouldn't describe that as "encoding syscalls". Maybe "defining syscall wrapper functions". They're defining their own wrapper function for each syscall, in an optimized way that falls through into one common handler at the bottom. In the C compiler's asm output, you'll still see call write.

(It might have been more compact for the final binary to use inline asm to let the compiler inline a syscall instruction with the args in the right registers, instead of making it look like a normal function that clobbers all the call-clobbered registers. Especially if compiled with clang -Oz which would use 3-byte push 2 / pop rax instead of 5-byte mov eax, 2 to set up the call number. push imm8/pop/syscall is the same size as call rel32.)

Yes, you can define functions in hand-written asm with .global foo / foo:. You could look at this as one large function with multiple entry points for different syscalls. In asm, execution always passes to the next instruction, regardless of labels, unless you use a jump/call/ret instruction. The CPU doesn't know about labels.

So it's just like a C switch(){} statement without break; between case: labels, or like C labels you can jump to with goto. Except of course in asm you can do this at global scope, while in C you can only goto within a function. And in asm you can call instead of just goto (jmp).

    static long callnum = 0;     // r9 = 0  before a call to any of these

    ...
    socket:
       callnum += 38;
    close:
       callnum++;         // can use inc instead of add 1
    open:                 // missed optimization in their asm
       callnum++;
    write:
       callnum++;
    read:
       tmp=callnum;
       callnum=0;
       retval = syscall(tmp, args);

Or if you recast this as a chain of tailcalls, where we can omit even the jmp foo and instead just fall through: C like this truly could compile to the hand-written asm, if you had a smart enough compiler. (And you could solve the arg-type

register long callnum asm("r9");     // GCC extension

long open(args...) {
   callnum++;
   return write(args...);
}
long write(args...) {
   callnum++;
   return read(args...); // tailcall
}
long read(args...){
       tmp=callnum;
       callnum=0;            // reset callnum for next call
       return syscall(tmp, args...);
}

args... are the arg-passing registers (RDI, RSI, RDX, RCX, R8) which they simply leave unmodified. R9 is the last arg-passing register for x86-64 System V, but they didn't use any syscalls that take 6 args. setsockopt takes 5 args so they couldn't skip the mov r10, rcx. But they were able to use r9 for something else, instead of needing it to pass the 6th arg.

That's amusing that they're trying so hard to save bytes at the expense of performance, but still use xor rbp,rbp instead of xor ebp,ebp. Unless they build with gcc -Wa,-Os start.S, GAS won't optimize away the REX prefix for you. (Does GCC optimize assembly source file?)

They could save another byte with xchg rax, r9 (2 bytes including REX) instead of mov rax, r9 (REX + opcode + modrm). (Code golf.SE tips for x86 machine code)

I'd also have used xchg eax, r9d because I know Linux system call numbers fit in 32 bits, although it wouldn't save code size because a REX prefix is still needed to encode the r9d register number. Also, in the cases where they only need to add 1, inc r9d is only 3 bytes, vs. add r9d, 1 being 4 bytes (REX + opcode + modrm + imm8). (The no-modrm short-form encoding of inc is only available in 32-bit mode; in 64-bit mode it's repurposed as a REX prefix.)

mov rsi,rsp could also save a byte as push rsp / pop rsi (1 byte each) instead of 3-byte REX + mov. That would make room for returning main's return value with xchg edi, eax before call exit.

But since they're not using libc, they could inline that exit, or put the syscalls below _start so they can just fall into it, because exit happens to be the highest-numbered syscall! Or at least jmp exit since they don't need stack alignment, and jmp rel8 is more compact than call rel32.

Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?

No, that's fully stand-alone incorporating the start.S code (at the ?_017: label), and maybe hand-tweaked compiler output. Perhaps from hand-tweaking disassembly of a linked executable, hence not having nice label names even for the part from the hand-written asm. (Specifically, from Agner Fog's objconv, which uses that format for labels in its NASM-syntax disassembly.)

(Ruslan also pointed out stuff like jnz after cmp, instead of jne which has the more appropriate semantic meaning for humans, so another sign of it being compiler output, not hand-written.)

I don't know how they arranged to get the compiler not to touch r9. It seems just luck. The readme indicates that just compiling the .c and .S works for them, with their GCC version.

As far as the ELF headers, see the comment at the top of the file, which links A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux - you'd assemble this with nasm -fbin and the output is a complete ELF binary, ready to run. Not a .o that you need to link + strip, so you get to account for every single byte in the file.

这篇关于这个没有 libc 的 C 程序如何工作?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

这个没有 libc 的 C 程序如何工作? [英] How does this C program without libc work?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

这个没有 libc 的 C 程序如何工作? [英] How does this C program without libc work?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭