由于从越界内存中跳过了 cmov,因此难以调试 SEGV [英] Hard to debug SEGV due to skipped cmov from out-of-bounds memory

查看:16
本文介绍了由于从越界内存中跳过了 cmov,因此难以调试 SEGV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一些高性能汇编函数作为练习,但在运行程序时遇到了一个奇怪的段错误,但在 valgrind 或 nemiver 中没有.

I'm trying to code a few high-performance assembly functions as an exercise, and have encountered a weird segfault that happens when running the program, but not in valgrind or nemiver.

基本上不应该运行的 cmov,地址越界,即使条件始终为假,也会让我出现段错误

Basically a cmov that shouldn't be run, with an out-of-bound address, makes me segfault even if the condition is always false

我有一个快速和一个慢速版本.慢的一直有效.快速的工作,除非它收到一个非 ascii 字符,此时它会严重崩溃,除非我在 adb 或 nemiver 上运行.

I have a fast and a slow version. The slow one works all the time. The fast one works, unless it receives a non-ascii char, at which point it crashes horribly, unless I'm running on adb or nemiver.

ascii_flags 只是一个 128 字节数组(末尾有一点空间),包含所有 ASCII 字符(字母、数字、可打印等)上的标志

ascii_flags is simply a 128 bytes array (with a bit of room at the end) containing flags on all ASCII characters (alpha, numeric, printable, etc.)

这行得通:

ft_isprint:
    xor EAX, EAX                ; empty EAX
    test EDI, ~127              ; check for non-ascii (>127) input
    jnz .error
    mov EAX, [rel ascii_flags + EDI]    ; load ascii table if input fits
    and EAX, 0b00001000         ; get specific bit
.error:
    ret

但这不是:

ft_isprint:
    xor EAX, EAX                ; empty EAX
    test EDI, ~127              ; check for non-ascii (>127) input
    cmovz EAX, [rel ascii_flags + EDI]  ; load ascii table if input fits
    and EAX, flag_print         ; get specific bit
    ret

Valgrind 确实崩溃了,但除了内存地址之外没有其他信息,因为我没有设法获得更多调试信息.

Valgrind does actually crash, but with no other information than memory addresses, since I've not managed to get more debugging information.

我已经编写了三个版本的函数来考虑精彩的答案:

I've written three versions of the functions to take in account the wonderful answers:

ft_isprint:
    mov RAX, 128                            ; load default index
    test RDI, ~127                          ; check for non-ascii (>127) input
    cmovz RAX, RDI                          ; if none are found, load correct index
    mov AL, byte [ascii_flags + RAX]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit (and zeros rest of RAX)
    ret

ft_isprint_branch:
    test RDI, ~127                          ; check for non-ascii (>127) input
    jnz .out_of_bounds                      ; if non-ascii, jump to error handling
    mov AL, byte [ascii_flags + RDI]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit (and zeros rest of RAX)
    ret
.out_of_bounds:
    xor RAX, RAX                            ; zeros return value
    ret

ft_isprint_compact:
    xor RAX, RAX                            ; zeros return value preemptively
    test RDI, ~127                          ; check for non-ascii (>127) input
    jnz .out_of_bounds                      ; if non-ascii was found, skip dereferenciation
    mov AL, byte [ascii_flags + RDI]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit
.out_of_bounds:
    ret

经过大量测试,在所有类型的数据上,分支函数肯定比 cmov 函数快约 5-15%.正如预期的那样,紧凑型和非紧凑型之间的差异很小.Compact 在可预测数据集上的速度稍快一些,而非 compact 在不可预测数据上的速度同样稍快.

After extensive testing, the branching functions are definitely faster than the cmov function by a factor of about 5-15% on all types of data. The difference between the compact and non-compact version is, as expected minimal. Compact is ever slightly faster on a predictable data set, while the non compact is just as slightly faster on non-predictable data.

我尝试了各种不同的方法来跳过xor EAX,EAX"指令,但找不到任何可行的方法.

I tried various different ways to skip the 'xor EAX, EAX' instruction, but couldn't find any that works.

经过更多测试,我已将代码更新为三个新版本:

after more testing, I've updated the code to three new versions:

ft_isprint_compact:
    sub EDI, 32                             ; substract 32 from input, to overflow any value < ' '
    xor EAX, EAX                            ; set return value to 0
    cmp EDI, 94                             ; check if input <= '~' - 32
    setbe AL                                ; if so, set return value to 1
    ret

ft_isprint_branch:
    xor EAX, EAX                            ; set return value to 0
    cmp EDI, 127                            ; check for non-ascii (>127) input
    ja .out_of_bounds                       ; if non-ascii was found, skip dereferenciation
    mov AL, byte [rel ascii_flags + EDI]    ; dereference index into least sig. byte
.out_of_bounds:
    ret

ft_isprint:
    mov EAX, 128                            ; load default index
    cmp EDI, EAX                            ; check if ascii
    cmovae EDI, EAX                         ; replace with 128 if outside 0..127
                                            ; cmov also zero-extends EDI into RDI
;   movzx EAX, byte [ascii_flags + RDI]     ; alternative to two following instruction if masking is removed
    mov AL, byte [ascii_flags + RDI]        ; load table entry
    and EAX, flag_print                     ; apply mask to get correct bit and zero rest of EAX
    ret

性能如下,以微秒为单位.1-2-3 显示了执行顺序,以避免缓存优势:

The performances are as follow, in microseconds. The 1-2-3 show the order of execution, to avoid a caching advantage:

-O3 a.out
1 cond 153185, 2 branch 238341 3 no_table 145436
1 cond 148928, 3 branch 248954 2 no_table 116629
2 cond 149599, 1 branch 226222 3 no_table 117428
2 cond 117258, 3 branch 241118 1 no_table 147053
3 cond 117635, 1 branch 228209 2 no_table 147263
3 cond 146212, 2 branch 220900 1 no_table 147377
-O3 main.c
1 cond 132964, 2 branch 157963 3 no_table 131826
1 cond 133697, 3 branch 159629 2 no_table 105961
2 cond 133825, 1 branch 139360 3 no_table 108185
2 cond 113039, 3 branch 162261 1 no_table 142454
3 cond 106407, 1 branch 133979 2 no_table 137602
3 cond 134306, 2 branch 148205 1 no_table 141934
-O0 a.out
1 cond 255904, 2 branch 320505 3 no_table 257241
1 cond 262288, 3 branch 325310 2 no_table 249576
2 cond 247948, 1 branch 340220 3 no_table 250163
2 cond 256020, 3 branch 415632 1 no_table 256492
3 cond 250690, 1 branch 316983 2 no_table 257726
3 cond 249331, 2 branch 325226 1 no_table 250227
-O0 main.c
1 cond 225019, 2 branch 224297 3 no_table 229554
1 cond 235607, 3 branch 199806 2 no_table 226286
2 cond 226739, 1 branch 210179 3 no_table 238690
2 cond 237532, 3 branch 223877 1 no_table 234103
3 cond 225485, 1 branch 201246 2 no_table 230591
3 cond 228824, 2 branch 202015 1 no_table 226788

no table 版本与 cmov 一样快,但不允许轻松实现本地变量.除非零优化中的可预测数据,否则分支算法会更糟?我没有任何解释.

The no table version is as about as fast as the cmov, but doesn't allow for easily implementable locals. The branching algorithm is worse unless on predictable data in zero optimization ? I've got no explanations there.

我会保留 cmov 版本,它既优雅又易于更新.感谢大家的帮助.

I'll keep the cmov version, which is both the most elegant and easily updatable. Thanks for all the help.

推荐答案

cmov 是一个 ALU 选择操作,它总是在检查条件之前读取两个源.使用内存源不会改变这一点.它不像 ARM 谓词指令,如果条件为假,则其行为类似于 NOP.cmovz eax, [mem] 也无条件地写入 EAX,无论条件如何,都将零扩展到 RAX.

cmov is an ALU select operation that always reads both sources before checking the condition. Using a memory source doesn't change this. It's not like an ARM predicated instruction that acts like a NOP if the condition was false. cmovz eax, [mem] also unconditionally writes EAX, zero-extending into RAX regardless of the condition.

就大部分 CPU 而言(乱序调度器等),cmovcc reg, [mem] 的处理方式与 adc 完全相同reg, [mem]:3输入1输出ALU指令.(adc 写入标志,与 cmov 不同,但没关系.)微融合内存源操作数是一个单独的 uop,恰好是同一 x86 指令的一部分.这也是 ISA 规则的工作方式.

As far as the most of the CPU is concerned (the out-of-order scheduler and so on), cmovcc reg, [mem] is handled exactly like adc reg, [mem]: a 3-input 1-output ALU instruction. (adc writes flags, unlike cmov, but nevermind that.) The micro-fused memory source operand is a separate uop that just happens to be part of the same x86 instruction. This is how the ISA rules for it work, too.

真的,将 cmovz 用作 selectz

x86 唯一的条件加载(不会在错误地址上出错,只是可能运行缓慢)是:

x86's only conditional loads (that don't fault on bad addresses, just potentially run slowly) are:

  • 受条件分支保护的正常负载.导致运行错误加载的分支错误预测或其他错误推测得到相当有效的处理(可能开始页面遍历,但是一旦识别出错误推测,执行正确的指令流就不必等待任何由推测执行启动的内存操作).

  • Normal loads protected by conditional branches. Branch mis-prediction or other mis-speculations leading to running a faulting load are handled fairly efficiently (maybe starting a page walk, but once the mis-speculation is identified, execution of the correct flow of instructions doesn't have to wait for any memory operation started by speculative execution).

如果在您无法阅读的页面上发生了 TLB 命中,那么在错误加载达到退休之前不会发生更多事情(已知是非推测性的,因此实际上会执行 #PF页面错误异常,这不可避免地会很慢).在某些 CPU 上,这种快速处理会导致 Meltdown 攻击.>.<参见 http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/.

If there was a TLB hit on a page you can't read, then not much more happens until a faulting load reaches retirement (known to be non-speculative and thus actually taking a #PF page-fault exception which is unavoidably going to be slow). On some CPUs, this fast handling leads to the Meltdown attack. >.< See http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/.

rep lodsd,RCX=0 或 1.(不是快速或高效,但微码分支很特殊,无法从分支预测中受益,on英特尔 CPU.请参阅 REP 做了什么设置?.Andy Glew 提到了微码分支错误预测,但我认为这些与正常的分支未命中不同,因为似乎有固定成本.)

rep lodsd with RCX=0 or 1. (not fast or efficient, but microcode branches are special and can't benefit from branch prediction, on Intel CPUs. See What setup does REP do?. Andy Glew mentions microcode branch mispredictions, but I think those are different from normal branch misses because there seems to be a fixed cost.)

AVX2 vpmaskmovd/q/AVX1 vmaskmovps/pd.对于掩码为 0 的元素,将抑制错误.即使来自合法地址的掩码为全 0 的掩码加载也需要使用 base+index 寻址模式进行约 200 个周期的微码辅助.)请参阅 12.9 节条件 SIMD 打包加载和存储 和英特尔优化手册中的表 C-8.(在 Skylake 上,使用全零掩码到非法地址的商店也需要协助.)

AVX2 vpmaskmovd/q / AVX1 vmaskmovps/pd. Faults are suppressed for elements where the mask is 0. A mask-load with an all-0 mask even from a legal address requires a ~200 cycle microcode assist with a base+index addressing mode.) See section 12.9 CONDITIONAL SIMD PACKED LOADS AND STORES and Table C-8 in Intel's optimization manual. (On Skylake, stores to an illegal address with an all-zero mask also need an assist.)

较早的 MMX/SSE2 maskmovdqu 仅限存储(并且有 NT 提示).只有具有 dword/qword(而不是字节)元素的类似 AVX 指令具有加载形式.

The earlier MMX/SSE2 maskmovdqu is store-only (and has an NT hint). Only the similar AVX instruction with dword/qword (instead of byte) elements has a load form.

AVX512 掩码加载

AVX512 masked loads

AVX2 聚集并清除部分/所有掩码元素.

AVX2 gathers with some / all mask elements cleared.

...也许还有其他我忘记了.TSX/RTM 事务中的正常负载:故障中止事务而不是引发#PF.但是你不能指望一个坏的索引错误,而不是仅仅从附近的某个地方读取虚假数据,所以它不是真正的条件加载.它也不是超级快.

... and maybe others I'm forgetting. Normal loads inside TSX / RTM transactions: a fault aborts the transaction instead of raising a #PF. But you can't count on a bad index faulting instead of just reading bogus data from somewhere nearby, so it's not really a conditional load. It's also not super fast.

替代方法可能是 cmov 一个您无条件使用的地址,选择从哪个地址加载.例如如果你有一个 0 可以从其他地方加载,那就可以了.但是您必须在寄存器中计算表索引,而不是使用寻址模式,因此您可以 cmov 最终地址.

An alternative might be to cmov an address that you use unconditionally, selecting which address to load from. e.g. if you had a 0 to load from somewhere else, that would work. But then you'd have to calculate the table indexing in a register, not using an addressing mode, so you could cmov the final address.

或者只是 CMOV 索引并在最后用一些零字节填充表格,以便您可以从 table + 128 加载.

Or just CMOV the index and pad the table with some zero bytes at the end so you can load from table + 128.

或者使用分支,它可能会很好地预测很多情况.但可能不适用于像法语这样的语言,您会在普通文本中发现低 128 和更高 Unicode 代码点的混合.

Or use a branch, it will probably predict well for a lot of cases. But maybe not for languages like French where you'll find a mix of low-128 and higher Unicode code-points in common text.

请注意,[rel] 仅在寻址模式中不涉及任何寄存器(RIP 除外)时才有效.RIP 相对寻址取代了编码 [disp32] 的两种冗余方式之一(在 32 位代码中).它使用较短的非 SIB 编码,而 ModRM+SIB 仍然可以编码一个没有寄存器的绝对 [disp32].(对于像 [fs: 16] 这样的地址很有用,用于相对于带有段基的线程本地存储的小偏移量.)

Note that [rel] only works when there's no register (other than RIP) involved in the addressing mode. RIP-relative addressing replaces one of the 2 redundant ways (in 32-bit code) to encode a [disp32]. It uses the shorter non-SIB encoding, while a ModRM+SIB can still encode an absolute [disp32] with no registers. (Useful for addresses like [fs: 16] for small offsets relative to thread-local storage with segment bases.)

如果您只想尽可能使用 RIP 相对寻址,请在文件顶部使用 default rel.[symbol] 将是 RIP 相关的,但 [symbol + rax] 不会.不幸的是,NASM 和 YASM 默认为 default abs.

If you just want to use RIP-relative addressing when possible, use default rel at the top of your file. [symbol] will be RIP-relative, but [symbol + rax] won't. Unfortunately, NASM and YASM default to default abs.

[reg + disp32] 是在位置相关代码中索引静态数据的一种非常有效的方法,只是不要自欺欺人地认为它可以与 RIP 相关.请参阅 32 位绝对地址不再允许在x86-64 Linux?.

[reg + disp32] is a very efficient way to index static data in position-dependent code, just don't fool yourself into thinking that it can be RIP-relative. See 32-bit absolute addresses no longer allowed in x86-64 Linux?.

[rel ascii_flags + EDI] 也很奇怪,因为您在 x86-64 代码中以寻址模式使用 32 位寄存器.通常没有理由花费地址大小前缀将地址截断为 32 位.

[rel ascii_flags + EDI] is also weird because you're using a 32-bit register in an addressing mode in x86-64 code. There's usually no reason to spend an address-size prefix to truncate addresses to 32-bit.

但是,在这种情况下,如果您的表位于虚拟地址空间的低 32 位,并且您的函数 arg 仅指定为 32 位(因此允许调用者将垃圾留在 RDI 的高 32 位),使用 [disp32 + edi] 而不是 mov esi,edi 或零扩展的东西实际上是一种胜利.如果您是故意这样做的,请明确说明您使用 32 位寻址模式的原因.

However, in this case if your table is in the low 32-bits of virtual address space, and your function arg is only specified as 32 bits (so the caller is allowed to leave garbage in the upper 32 of RDI), it is actually a win to use [disp32 + edi] instead of a mov esi,edi or something to zero-extend. If you're doing that on purpose, definitely comment why you're using a 32-bit addressing mode.

但在这种情况下,在索引上使用 cmov 将为您零扩展为 64 位.

But in this case, using a cmov on the index will zero-extend to 64-bit for you.

从字节表中使用 DWORD 加载也很奇怪.您有时会跨越缓存线边界并遭受额外的延迟.

It's also weird to use a DWORD load from a table of bytes. You'll occasionally cross a cache-line boundary and suffer extra latency.

@fuz 展示了一个在索引上使用 RIP 相对 LEA CMOV 的版本.

@fuz showed a version using a RIP-relative LEA and a CMOV on the index.

在 32 位绝对地址可以使用的位置相关代码中,一定要使用它来保存指令.[disp32] 寻址模式比 RIP-relative 差(长 1 个字节),但 [reg + disp32] 寻址模式在位置相关代码和 32-位绝对地址没问题.(例如 x86-64 Linux,但不是 OS X,其中可执行文件始终映射到低 32 位之外.)请注意,它不是 rel.

In position-dependent code where 32-bit absolute addresses are ok, by all means use it to save instructions. [disp32] addressing modes are worse than RIP-relative (1 byte longer), but [reg + disp32] addressing modes are perfectly fine when position-dependent code and 32-bit absolute addresses are ok. (e.g. x86-64 Linux, but not OS X where executable are always mapped outside the low 32 bits.) Just be aware that it's not rel.

; position-dependent version taking advantage of 32-bit absolute [reg + disp32] addressing
; not usable in shared libraries, only non-PIE executables.
ft_isprint:
    mov     eax, 128               ; offset of dummy entry for "not ASCII"
    cmp     edi, eax               ; check if ascii
    cmovae  edi, eax               ; replace with 128 if outside 0..127
              ; cmov also zero-extends EDI into RDI
    movzx   eax, byte [ascii_flags + rdi] ; load table entry
    and     al, flag_print         ; mask the desired flag
      ; if the caller is only going to read / test AL anyway, might as well save bytes here
    ret

如果您的表中的任何现有条目具有您想要用于高输入的相同标志,例如也许你永远不会在隐式长度字符串中看到的条目 0,你仍然可以对 EAX 进行异或零并将表保持在 128 字节,而不是 129 字节.

If any existing entry in your table has the same flags you want for high inputs, e.g. maybe entry 0 which you'll never see in implicit-length strings, you could still xor-zero EAX and keep your tables at 128 bytes, not 129.

test r32, imm32 占用的代码字节多于您的需要.~127 = 0xFFFFFF80 适合符号扩展字节,但不是 TEST r/m32, sign-extended-imm8 编码.但是,cmp 也有这样的编码,基本上就像所有其他立即指令一样.

test r32, imm32 takes more code bytes than you need. ~127 = 0xFFFFFF80 would fit in a sign-extended byte, but is no TEST r/m32, sign-extended-imm8 encoding. There is such an encoding for cmp, though, like essentially all other immediate instructions.

您可以改为使用 cmp edi, 127/cmovbe eax, edicmova edi, eax 检查 127 以上的无符号数.这节省了 3 个字节的代码大小.或者我们可以通过 cmp reg,reg 使用我们用于表索引的 128 来节省 4 个字节.

You could instead check for unsigned above 127, with cmp edi, 127 / cmovbe eax, edi or cmova edi, eax. This saves 3 bytes of code-size. Or we can save 4 bytes by using cmp reg,reg using the 128 we used for a table index.

对于大多数人来说,数组索引之前的范围检查也比检查高位更直观.

and al, imm8 只有 2 个字节,而 和 r/m32, sign-extended-imm8 是 3 个字节.只要调用者只读取 AL,它在任何 CPU 上都不会变慢.在 Sandybridge 之前的 Intel CPU 上,在与 AL 进行与运算后读取 EAX 可能会导致部分寄存器停止/减速.如果我没记错的话,Sandybridge 不会重命名用于读取-修改-写入操作的部分寄存器,并且 IvB 及更高版本根本不会重命名 low8 部分寄存器.

and al, imm8 is only 2 bytes, vs. 3 bytes for and r/m32, sign-extended-imm8. It's not slower on any CPUs, as long as the caller only reads AL. On Intel CPUs before Sandybridge, reading EAX after ANDing into AL could cause a partial-register stall / slowdown. Sandybridge doesn't rename partial registers for read-modify-write operations, if I recall correctly, and IvB and later don't rename low8 partial regs at all.

您也可以使用 mov al, [table] 而不是 movzx 来保存另一个代码字节.较早的 mov eax, 128 已经破坏了对 EAX 旧值的任何错误依赖,因此它不应该有性能下降.但是 movzx 也不错.

You might also use mov al, [table] instead of movzx to save another code byte. An earlier mov eax, 128 already broke any false dependency on the old value of EAX so it shouldn't have a performance downside. But movzx is not a bad idea.

当所有其他条件相同时,更小的代码大小几乎总是更好(对于指令缓存占用空间,有时用于打包到 uop 缓存中).但是,如果它花费任何额外的微指令或引入任何错误的依赖项,那么在优化速度时就不值得了.

When all else is equal, smaller code-size is almost always better (for instruction-cache footprint, and sometimes for packing into the uop cache). If it cost any extra uops or introduced any false dependencies, it wouldn't be worth it when optimizing for speed, though.

这篇关于由于从越界内存中跳过了 cmov,因此难以调试 SEGV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆