通过 x86-64 汇编程序进行管道传输时的竞争条件 [英] Race condition when piping through x86-64 assembly program

查看:53
本文介绍了通过 x86-64 汇编程序进行管道传输时的竞争条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在汇编中编写了以下 cat 的简化实现.它使用 linux 系统调用,因为我正在运行 linux.代码如下:

I wrote the following simplified implementation of cat in assembly. It uses linux syscalls because I am running linux. Here's the code:

.section .data
.set MAX_READ_BYTES, 0xffff

.section .text
.globl _start

_start:
    movq (%rsp), %r10 # save the value of argc somewhere else
    movq 16(%rsp), %r9 # save the value of argv[1] somewhere else

    movl $12, %eax # syscall 12 is brk. see brk(2)
    xorq %rdi, %rdi # call with 0 as first arg to get current end of memory
    syscall
    movq %rax, %r8 # this is the address of the current end of memory

    leaq MAX_READ_BYTES(%rax), %rdi # let this be the new end of memory
    movl $12, %eax # syscall 12, brk
    syscall
    cmp %r8, %rax # compare the two; if the allocation failed, these will be equal
    je exit

    leaq -MAX_READ_BYTES(%rax), %r13 # store the start of the free area in %r13

    movq %r10, %rdi # retrieve the value of argc
    cmpq $0x01, %rdi # if there are no cli args, process stdin instead
    je stdin

    # open the file
    movl $0x02, %eax # syscall #2 = open.
    movq %r9, %rdi
    movl $0, %esi # second argument: flags. 0 means read-only.
    xorq %rdx, %rdx # this argument isn't used here, but zero it out for peace of mind.
    syscall # returns the file descriptor number in %rax
    movl %eax, %edi
    movl %edi, %r12d # first argument: file descriptor.
    call read_and_write
    jmp cleanup

stdin:
    movl $0x0000, %edi # first argument: file descriptor.
    movl %edi, %r12d # first argument: file descriptor.
    call read_and_write
    jmp cleanup

read_and_write:
    # read the file.
    movl $0, %eax # syscall #0 = read.
    movl %r12d, %edi
    movq %r13 /* pointer to allocated memory */, %rsi # second argument: address of a writeable buffer.
    movl $MAX_READ_BYTES, %edx # third argument: number of bytes to write.
    syscall # num bytes read in %rax
    movl %eax, %r15d

    # print the file
    movl $1, %eax # syscall #1 = write.
    movl $1, %edi # first argument: file descriptor. 1 is stdout.
    movq %r13, %rsi # second argument: address of data to write.
    movl %r15d, %edx # third argument: number of bytes to write.
    syscall # result ignored.
    cmpq $MAX_READ_BYTES, %r15
    je read_and_write
    ret

cleanup:
    # close the file
    movl $0x03, %eax # syscall #3 = close.
    movl %r14d, %edi # first arg: file descriptor number.
    syscall # result ignored.

exit:
    # set the exit code
    movl $60, %eax # syscall #60 = exit.
    movq $0, %rdi # exit 0 = success.
    syscall

我已将其组装成名为 asmcat 的 ELF 二进制文件.为了测试这个程序,我有文件 /tmp/random:

I have assembled this into an ELF binary called asmcat. To test this program, I've got the file /tmp/random:

$ wc -c /tmp/random
94870 /tmp/random

当我运行以下时,结果是一致的:

When I run the following, the results are consistent:

$ ./asmcat /tmp/random | wc -c
94870

以下是同一命令的两次单独运行:

Here are two separate runs of the same command:

$ cat /tmp/random | ./asmcat | wc -c
65536

$ cat /tmp/random | ./asmcat | wc -c
94870

将输出重定向到文件一致地生成相同大小的文件:

Redirecting the output to a file consistently generates files of the same size:

for i in {0..25}; do
    cat /tmp/random | ./asmcat > /tmp/asmcat-output-$i
done
for i in {0..25}; do
    wc -c /tmp/asmcat-output-$i
done

所有生成的文件都具有相同的大小,94870.这让我相信 wc 的管道是导致不一致行为的原因.我的程序应该做的就是读取标准输入,一次 65535 个字节,然后写入标准输出.程序中可能存在错误,但是,为什么它会始终重定向到大小一致的文件?所以我强烈的感觉是管道的某些方面导致了我的汇编程序输出大小的不一致测量.

All of the resulting files have the same size, 94870. This leads me to believe that the pipe to wc is what is causing the inconsistent behavior. All my program should be doing is reading stdin, 65535 bytes at a time, and writing to stdout. It's possible that there's a bug in the program, but then, why would it consistently redirect to files of consistent sizes? So my strong feeling is that something about the piping is causing an inconsistent measure of the size of my assembly program's output.

欢迎提供任何反馈,包括汇编程序中采用的方法(我只是为了好玩/练习而编写的).

Any feedback is welcome, including the approach taken in the assembly program (which I just wrote for fun/practice).

推荐答案

TL:DR:如果您的程序在 cat 重新填充管道缓冲区之前执行了两次读取,
第二次读取仅获得 1 个字节.这会让您的程序决定提前退出.

TL:DR: If your program does two reads before cat can refill the pipe buffer,
the 2nd read gets only 1 byte. That makes your program decide to exit prematurely.

这才是真正的错误.使这成为可能的其他设计选择是性能问题,而不是正确性.

That's the real bug. The other design choices that make this possible are performance problems, not correctness.

您的程序在任何短读后停止(返回值小于请求的大小),而不是等待 EOF(read() == 0).这是一种简化,有时对于常规文件是安全的,但对于其他任何东西安全,尤其是不是 TTY(终端输入),而且对于管道或套接字也不安全.例如尝试运行 ./asmcat;它在你在一行上按回车后退出,而不是等待 control-D EOF.

Your program stops after any short-read (one where the return value is less than the requested size), instead of waiting for EOF (read() == 0). This is a simplification that's sometimes safe for regular files, but not safe for anything else, especially not a TTY (terminal input), but also not for pipes or sockets. e.g. try running ./asmcat; it exits after you press return on one line, instead of waiting for control-D EOF.

Linux 管道缓冲区默认仅为 64kiB(pipe(7) 手册页),比您使用的奇怪大小的缓冲区大 1 个字节.cat 的写入填满管道缓冲区后,您的 65535 字节读取还剩下 1 个字节.如果您的程序在 cat 再次写入之前赢得了 read 管道的竞赛,它只会读取 1 个字节.

Linux pipe buffers are by default only 64kiB (pipe(7) man page), 1 byte larger than the weird odd-sized buffer you're using. After cat's write fills the pipe buffer, your 65535-byte read leaves 1 byte remaining. If your program wins the race to read the pipe before cat can write again, it reads only 1 byte.

不幸的是,在 strace ./asmcat 下运行会使读取速度减慢太多以观察短读,除非您也减慢 cat 或任何其他要评分的程序- 限制输入管道的写入端.

Unfortunately, running under strace ./asmcat slows down the reads too much to observe a short-read, unless you also slow down cat or whatever other program to rate-limit the write side of your input pipe.

pv(1) 对此很方便,它具有速率限制 -L 选项和缓冲区大小限制,因此您可以确保其写入小于 64k.(很少进行较大的 64k 写入可能并不总是导致短读.)但是如果我们只想总是短读,那么从终端运行交互式读取就更容易了.strace ./asmcat

pv(1), the pipe-viewer, is handy for this, with rate-limit -L option, and a buffer-size limit so you can make sure its writes are smaller than 64k. (Doing a larger 64k write very infrequently might not always lead to short reads.) But if we just want short reads always, running interactively reading from a terminal is even easier. strace ./asmcat

$ pv -L8K -B16K /tmp/random | strace ./orig_asmcat | wc -c
execve("./orig_asmcat", ["./orig_asmcat"], 0x7ffcd441f750 /* 55 vars */) = 0
brk(NULL)                               = 0x61c000
brk(0x62bfff)                           = 0x62bfff
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65535) = 819
write(1, "=head1 NAME\n\n=for comment  Gener"..., 819) = 819
close(0)                                = 0
exit(0)                                 = ?
+++ exited with 0 +++   # end of strace output
819                     # wc output
 819 B 0:00:00 [4.43KiB/s] [>              ]  0%        # pv's progress bar

对比通过修正了一个错误的 asmcat,我们得到了预期的短读和等长写序列.(我的版本见下文)

vs. with a bugfixed asmcat, we get the expected sequence of short-reads and equal-sized writes. (See below for my version)

execve("./asmcat", ["./asmcat"], 0x7ffd8c58f600 /* 55 vars */) = 0
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65536) = 819
write(1, "=head1 NAME\n\n=for comment  Gener"..., 819) = 819
read(0, "check if a\nnamed variable exists"..., 65536) = 819
write(1, "check if a\nnamed variable exists"..., 819) = 819


代码审查

有多个浪费的指令,例如一个 mov 写入一个你永远不会再读的寄存器,比如在调用之前设置 EDI,但是函数调用将 R12D 作为参数,而不是标准调用约定.


Code review

There are multiple wasted instructions, e.g. a mov that writes a register you never read again, like setting EDI before a call, but then the function call takes R12D as the arg, instead of the standard calling convention.

尽早读取 argc、argv 而不是将它们留在堆栈中直到需要它们时同样是多余的.

Reading argc, argv early instead of just leaving them on the stack until they're needed is similarly redundant.

.data 毫无意义:.set 是一个汇编时常量.定义当前部分时,它是什么并不重要.您也可以将其写为 MAX_READ_BYTES = 0xffff,这是汇编时常量的更自然的语法.

.data is pointless: .set is an assemble-time constant. It doesn't matter what the current section is when you define it. You could also write it as MAX_READ_BYTES = 0xffff, more natural syntax for assemble-time constants.

可以在堆栈上而不是使用 brk 分配缓冲区(它只有 64K - 1,并且 x86-64 Linux 默认允许 8MiB 堆栈),在这种情况下,提前加载是​​有意义的.或者只使用 BSS,例如lcomm buf, 1<<16

You could allocate your buffer on the stack instead of with brk (it's only 64K - 1, and x86-64 Linux allows 8MiB stacks by default), in which case loading early could make sense. Or just use the BSS, e.g. lcomm buf, 1<<16

使缓冲区为 2 的幂,或者至少是页面大小 (4k) 的倍数以提高效率是个好主意.如果你用它来拷贝文件,第一次之后的每一次读取都会在接近页尾的时候开始,而不是拷贝整个 4k 页,所以内核的 copy_to_user(读取)和 copy_from_user(写入)每次读/写将接触 17 页内核内存,而不是 16 页.文件数据的页缓存可能不在连续的内核地址中,因此每个单独的 4k 页需要一些开销才能找到,并在具有 ERMSB 功能的现代 CPU 上为 (rep movsb) 启动单独的 memcpy.同样对于磁盘 I/O,内核必须将您的写入缓冲回对齐的块,这些块是硬件扇区大小和/或文件系统块大小的倍数.

It would be a good idea to make your buffer a power of 2, or at least a multiple of the page size (4k), for efficiency. If you use it to copy files, every read after the first one will start near the end of a page, instead of copying a whole number of 4k pages, so the kernel's copy_to_user (read) and copy_from_user (write) will be touching 17 pages of kernel memory per read/write instead of 16. The pagecache for the file data may not be in contiguous kernel addresses, so each separate 4k page takes some overhead to find, and start a separate memcpy for (rep movsb on modern CPUs with the ERMSB feature). Also for disk I/O, the kernel will have to buffer your writes back into aligned chunks of some multiple of the HW sector size and/or filesystem block size.

从管道读取时,64KiB 显然是一个不错的选择,出于同样的原因,这场比赛是可能的.留下 1 个字节显然是低效的.此外,64k 小于 L2 缓存大小,因此当您再次写入时,到/从用户空间(系统调用的内核内部)的复制可以从 L2 缓存中重新读取.但是较小的大小意味着更多的系统调用,并且每个系统调用都有显着的开销(尤其是现代内核中的 Meltdown 和 Spectre 缓解.)

64KiB is clearly a good choice when reading from pipes, for the same reason this race was possible. Leaving 1 byte is obviously inefficient. Also, 64k is smaller than L2 cache sizes, so the copy to/from user-space (inside the kernel in your system calls) can re-read from L2 cache when you write again. But smaller sizes mean more system calls, and each system-call has significant overhead (especially with Meltdown and Spectre mitigation in modern kernels.)

64KiB 到 128KiB 是缓冲区大小的最佳选择,因为典型的二级缓存为 256KiB.(相关:代码高尔夫:最快yesThe West 使用 x86-64 Linux 调整了一个只进行 write 系统调用的程序,并在我的 Skylake 桌面上进行分析/基准测试.)

64KiB to 128KiB is about a sweet spot for buffer size, given 256KiB L2 caches being typical. (Related: code golf: Fastest yes in the West tunes a program that just makes write system-calls, with x86-64 Linux, with profiling / benchmark results on my Skylake desktop.)

机器代码中的任何内容都没有像 0xFFFF 那样从 uint16_t 中的大小拟合中受益;int8_t 或 int32_t 与 64 位代码中的立即数操作数大小相关.(或者 uint32_t 如果你像 mov $imm32, %edx 这样零扩展到 RDX 零扩展.)

Nothing in the machine code benefits from the size fitting in a uint16_t like 0xFFFF does; either int8_t or int32_t are relevant for immediate operand sizes in 64-bit code. (Or uint32_t if you're zero-extending like mov $imm32, %edx to zero-extend into RDX.)

不要关闭stdin;你无条件地运行 close.关闭标准输入不会影响父进程的标准输入,所以它在这个程序中不应该是一个问题,但是 close 的重点似乎是让它更像是一个你可以从大程序.因此,您应该将将 fd 复制到标准输出与文件处理分开.

Don't close stdin; you run close unconditionally. closing stdin doesn't affect the parent process's stdin so it shouldn't be a problem in this program, but the whole point of close seems to be to make this more like a function you could use from a large program. So you should separate your copying fd to stdout from the file-handling.

使用 #include 获取电话号码,而不是对其进行硬编码.它们保证稳定,但仅使用命名常量更易于人类阅读/自我记录,并避免任何复制错误的风险.(Build with gcc -nostdlib -static asmcat.S -o asmcat; GCC 在汇编前通过 C 预处理器运行 .S 文件,不像 <代码>.s)

Use #include <asm/unistd.h> to get call numbers instead of hardcoding them. They're guaranteed stable, but it's more human readable / self-documenting to just use the named constants, and avoids any risk of copying errors. (Build with gcc -nostdlib -static asmcat.S -o asmcat; GCC runs .S files through the C preprocessor before assembling, unlike .s)

样式:我喜欢将操作数缩进到一致的列,这样它们就不会拥挤助记符.同样,注释应该舒适地放在操作数的右侧,这样您就可以向下浏览该列以获取访问任何给定寄存器的指令,而不会被较短指令的注释分心.

Style: I like to indent operands to a consistent column so they're not crowding mnemonics. Similarly, comments should be comfortably to the right of operands so you can scan down the column for instructions accessing any given register without getting distracted by comments on shorter instructions.

注释内容:指令本身已经说明了它的作用,注释应该描述语义.(我不需要注释来提醒我调用约定,比如系统调用在 RAX 中留下结果,但即使你这样做,用它的 C 版本总结系统调用可以很好地提醒哪个 arg 是哪个. 像 open(argv[1], O_RDONLY).)

Comment content: The instruction itself already says what it does, the comment should describe the semantic meaning. (I don't need comments to remind me of calling conventions, like that system calls leave a result in RAX, but even if you do, summarizing the system call with a C version of it can be a good reminder of which arg is which. Like open(argv[1], O_RDONLY).)

我也喜欢删除多余的操作数大小后缀;寄存器大小意味着操作数大小(就像英特尔语法一样).请注意,将 64 位寄存器清零只需要 xorl;写入 32 位寄存器隐式零扩展到 64 位.您的代码有时会与应该是 32 位还是 64 位不一致.在我的重写中,我尽可能使用 32 位.(除了 cmp %rax, %rdx 从 write 返回值,制作 64 位似乎是个好主意,尽管我认为没有任何真正的原因.)

I also like to remove redundant operand-size suffixes; the register sizes imply operand-size (just like Intel-syntax). Note that zeroing a 64-bit register only requires xorl; writing a 32-bit register implicitly zero-extends to 64-bit. Your code is sometimes inconsistent about whether things should be 32 or 64-bit. In my rewrite, I used 32-bit everywhere I could. (Except cmp %rax, %rdx return value from write, which seemed like a good idea to make 64-bit, although I don't think there's any real reason.)

我删除了 call/ret 的东西,只是让它进入清理/退出而不是试图将它分成函数".

I removed the call/ret stuff, and just let it fall through into cleanup/exit instead of trying to separate it into "functions".

我还准确地将缓冲区大小更改为 64KiB,以 4k 页面对齐方式分配在堆栈上,并重新安排内容以简化和保存随处可见的指令.

I also changed the buffer size to 64KiB exactly, allocated on the stack with 4k page alignment, and rearranged things to simplify and save instructions everywhere.

还添加了关于短写入# TODO注释.对于高达 64k 的管道写入,这似乎不会发生;Linux 只是在缓冲区有空间之前阻止写入,但写入套接字可能会出现问题?或者可能只有更大的尺寸,或者如果像 SIGTSTP 或 SIGSTOP 这样的信号中断 write()

Also added a # TODO comment about short writes. That doesn't seem to happen for pipe writes up to 64k; Linux just blocks the write until the buffer has room, but could be a problem writing to a socket maybe? Or maybe only with a larger size, or if a signal like SIGTSTP or SIGSTOP interrupts write()

#include <asm/unistd.h>
BUFSIZE = 1<<16

.section .text
.globl _start
_start:
    pop  %rax      # argc
    pop  %rdi
    pop  %rdi      # argv[1]
     # you'd only ever want to read args this way in _start, which isn't a function

    and  $-4096, %rsp           # round RSP down to a page boundary.
    sub  $BUFSIZE, %rsp         # reserve 64K buffer aligned by 4k

    dec  %eax      # if argc == 1,  then run with input fd = 0   (stdin)
    jz  .Luse_stdin

    # open argv[1]
    mov     $__NR_open, %eax 
    xor     %esi, %esi     # flags: 0 means read-only.
    xor     %edx, %edx     # mode unused without O_CREAT, but zero it out for peace of mind.
    syscall       # fd = open(argv[1], O_RDONLY)

.Luse_stdin:           # don't use stdin as a symbol name; stdio.h / libc also has one of type FILE*
    mov  %eax, %ebx     # save FD
    mov  %rsp, %rsi     # always read and write the same buffer
    jmp  .Lentry        # start with a read then EOF-check as loop condition
              # since we're now error-checking the write,
              # rotating the loop maybe wasn't helpful after all
              # and perhaps just read at the top so we can fall into it would work equally well

read_and_write:              # do {
    # print the file
    mov     %eax, %edx             # size = read_size
    mov     $__NR_write, %eax      # syscall #1 = write.
    mov     $1, %edi               # output fd always stdout
    #mov     %rsp, %rsi             # buf, done once outside loop
    syscall                        # write(1, buf, read_size)

    cmp     %rax, %rdx             # written size should match request
    jne     cleanup                 # TODO: handle short writes by calling again for the unwritten part of the buffer, e.g. add %rax, %rsi
                                    # but also check for write errors.
.Lentry:
     # read the file.
    mov    $__NR_read, %eax     # xor  %eax, %eax
    mov    %ebx, %edi           # input FD
   # mov    %rsp, %rsi           # done once outside loop
    mov    $BUFSIZE, %edx
    syscall                     # size = read(fd, buf, BUFSIZE)

    test   %eax, %eax
    jg     read_and_write    # }while(read_size > 0);   // until EOF or error
# any negative can be assumed to be an error, since we pass a size smaller than INT_MAX

cleanup:
# fd might be stdin which we don't want to close.
# just exit and let kernel take care of it, or check for fd==0
#    movl $__NR_close, %eax
#    movl %ebx, %edi 
#    syscall          # close (fd)  // return value ignored

exit:
    mov  %eax, %edi             # exit status = last syscall return value. read() = 0 means EOF, success.
    mov  $__NR_exit_group, %eax
    syscall                     # exit_group(status);

对于指令计数,perf stat --all-user ./asmcat/tmp/random >/dev/null 显示它在用户空间中运行了大约 47 条指令,而您的则为 57 条.(IIRC,perf 多计数了 1,所以我从测量结果中减去了它.)而且还有更多的错误检查,例如短文.

For instruction counts, perf stat --all-user ./asmcat /tmp/random > /dev/null shows it runs about 47 instructions in user-space, vs. 57 for yours. (IIRC, perf over-counts by 1, so I've subtracted that from the measured result.) And that's with more error-checking, e.g. for short writes.

这在 .text 部分中只有 84 个字节的机器代码(而原始文件为 174 个字节),而且我没有使用诸如 lea 1(%rsi), % 之类的东西来优化大小超过速度eax(在将 RSI 归零后)而不是 mov $1, %eax.(或者使用 mov %eax, %edi 来利用 _NR_write == STDIN_FILENO.)

This is only 84 bytes of machine code in the .text section (vs. 174 bytes for your original), and I didn't optimize for size over speed with stuff like lea 1(%rsi), %eax (after zeroing RSI) instead of mov $1, %eax. (Or with mov %eax, %edi to take advantage of _NR_write == STDIN_FILENO.)

我主要避免使用 R8..R15,因为它们需要 REX 前缀才能在机器代码中访问.

I mostly avoided R8..R15 because they need REX prefixes to access in the machine code.

错误处理测试:

$ gcc -nostdlib -static asmcat.S -o asmcat            # build
$ cat /tmp/random | strace ./asmcat > /dev/full

execve("./asmcat", ["./asmcat"], 0x7ffde5e369d0 /* 55 vars */) = 0
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65536) = 65536
write(1, "=head1 NAME\n\n=for comment  Gener"..., 65536) = -1 ENOSPC (No space left on device)
exit_group(-28)                         = ?
+++ exited with 228 +++

$ strace ./asmcat <&-      # close stdin
execve("./asmcat", ["./asmcat"], 0x7ffd0f5048c0 /* 55 vars */) = 0
read(0, 0x7ffc1b3ca000, 65536)          = -1 EBADF (Bad file descriptor)
exit_group(-9)                          = ?
+++ exited with 247 +++

$ strace ./asmcat /noexist
execve("./asmcat", ["./asmcat", "/noexist"], 0x7ffd429f1158 /* 55 vars */) = 0
open("/noexist", O_RDONLY)              = -1 ENOENT (No such file or directory)
read(-2, 0x7ffd4f296000, 65536)         = -1 EBADF (Bad file descriptor)
exit_group(-9)                          = ?
+++ exited with 247 +++

嗯,如果你想做错误处理,应该在打开后在 fd 上测试/jl.

Hmm, should probably test/jl on the fd after open, if you wanted to do error handling.

这篇关于通过 x86-64 汇编程序进行管道传输时的竞争条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆