由于从越界内存中跳过了cmov,因此难以调试SEGV [英] Hard to debug SEGV due to skipped cmov from out-of-bounds memory

查看:95
本文介绍了由于从越界内存中跳过了cmov,因此难以调试SEGV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一些高性能的汇编函数编写为练习,并且遇到了奇怪的段错误,该段错误在运行程序时发生,但在valgrind或nemiver中没有.

I'm trying to code a few high-performance assembly functions as an exercise, and have encountered a weird segfault that happens when running the program, but not in valgrind or nemiver.

基本上,一个不应该运行的cmov带有一个超出范围的地址,即使条件始终为false,也使我处于段错误状态

Basically a cmov that shouldn't be run, with an out-of-bound address, makes me segfault even if the condition is always false

我有一个快版本和一个慢版本.慢的人一直都在工作.除非我收到一个非ascii字符,否则最快的一个就可以工作,这时除非我在adb或nemiver上运行,否则它会崩溃得很厉害.

I have a fast and a slow version. The slow one works all the time. The fast one works, unless it receives a non-ascii char, at which point it crashes horribly, unless I'm running on adb or nemiver.

ascii_flags是一个128字节的数组(末尾有一点空格),其中包含所有ASCII字符(字母,数字,可打印等)上的标志

ascii_flags is simply a 128 bytes array (with a bit of room at the end) containing flags on all ASCII characters (alpha, numeric, printable, etc.)

这有效:

ft_isprint:
    xor EAX, EAX                ; empty EAX
    test EDI, ~127              ; check for non-ascii (>127) input
    jnz .error
    mov EAX, [rel ascii_flags + EDI]    ; load ascii table if input fits
    and EAX, 0b00001000         ; get specific bit
.error:
    ret

但这不是:

ft_isprint:
    xor EAX, EAX                ; empty EAX
    test EDI, ~127              ; check for non-ascii (>127) input
    cmovz EAX, [rel ascii_flags + EDI]  ; load ascii table if input fits
    and EAX, flag_print         ; get specific bit
    ret

Valgrind实际上确实崩溃了,但是除了内存地址外没有其他信息,因为我没有设法获得更多的调试信息.

Valgrind does actually crash, but with no other information than memory addresses, since I've not managed to get more debugging information.

我已经编写了三个版本的函数,以考虑到出色的答案:

I've written three versions of the functions to take in account the wonderful answers:

ft_isprint:
    mov RAX, 128                            ; load default index
    test RDI, ~127                          ; check for non-ascii (>127) input
    cmovz RAX, RDI                          ; if none are found, load correct index
    mov AL, byte [ascii_flags + RAX]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit (and zeros rest of RAX)
    ret

ft_isprint_branch:
    test RDI, ~127                          ; check for non-ascii (>127) input
    jnz .out_of_bounds                      ; if non-ascii, jump to error handling
    mov AL, byte [ascii_flags + RDI]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit (and zeros rest of RAX)
    ret
.out_of_bounds:
    xor RAX, RAX                            ; zeros return value
    ret

ft_isprint_compact:
    xor RAX, RAX                            ; zeros return value preemptively
    test RDI, ~127                          ; check for non-ascii (>127) input
    jnz .out_of_bounds                      ; if non-ascii was found, skip dereferenciation
    mov AL, byte [ascii_flags + RDI]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit
.out_of_bounds:
    ret

经过广泛的测试,在所有类型的数据上,分支功能绝对比cmov函数快大约5-15%.紧凑版本和非紧凑版本之间的差异是所希望的最小.在可预测的数据集上,压缩会稍快一点,而在不可预测的数据上,压缩会稍微快一点.

After extensive testing, the branching functions are definitely faster than the cmov function by a factor of about 5-15% on all types of data. The difference between the compact and non-compact version is, as expected minimal. Compact is ever slightly faster on a predictable data set, while the non compact is just as slightly faster on non-predictable data.

我尝试了各种不同的方法来跳过"xor EAX,EAX"指令,但找不到任何有效的方法.

I tried various different ways to skip the 'xor EAX, EAX' instruction, but couldn't find any that works.

经过更多测试,我将代码更新为三个新版本:

after more testing, I've updated the code to three new versions:

ft_isprint_compact:
    sub EDI, 32                             ; substract 32 from input, to overflow any value < ' '
    xor EAX, EAX                            ; set return value to 0
    cmp EDI, 94                             ; check if input <= '~' - 32
    setbe AL                                ; if so, set return value to 1
    ret

ft_isprint_branch:
    xor EAX, EAX                            ; set return value to 0
    cmp EDI, 127                            ; check for non-ascii (>127) input
    ja .out_of_bounds                       ; if non-ascii was found, skip dereferenciation
    mov AL, byte [rel ascii_flags + EDI]    ; dereference index into least sig. byte
.out_of_bounds:
    ret

ft_isprint:
    mov EAX, 128                            ; load default index
    cmp EDI, EAX                            ; check if ascii
    cmovae EDI, EAX                         ; replace with 128 if outside 0..127
                                            ; cmov also zero-extends EDI into RDI
;   movzx EAX, byte [ascii_flags + RDI]     ; alternative to two following instruction if masking is removed
    mov AL, byte [ascii_flags + RDI]        ; load table entry
    and EAX, flag_print                     ; apply mask to get correct bit and zero rest of EAX
    ret

性能如下(以微秒为单位). 1-2-3显示了执行顺序,以避免缓存优势:

The performances are as follow, in microseconds. The 1-2-3 show the order of execution, to avoid a caching advantage:

-O3 a.out
1 cond 153185, 2 branch 238341 3 no_table 145436
1 cond 148928, 3 branch 248954 2 no_table 116629
2 cond 149599, 1 branch 226222 3 no_table 117428
2 cond 117258, 3 branch 241118 1 no_table 147053
3 cond 117635, 1 branch 228209 2 no_table 147263
3 cond 146212, 2 branch 220900 1 no_table 147377
-O3 main.c
1 cond 132964, 2 branch 157963 3 no_table 131826
1 cond 133697, 3 branch 159629 2 no_table 105961
2 cond 133825, 1 branch 139360 3 no_table 108185
2 cond 113039, 3 branch 162261 1 no_table 142454
3 cond 106407, 1 branch 133979 2 no_table 137602
3 cond 134306, 2 branch 148205 1 no_table 141934
-O0 a.out
1 cond 255904, 2 branch 320505 3 no_table 257241
1 cond 262288, 3 branch 325310 2 no_table 249576
2 cond 247948, 1 branch 340220 3 no_table 250163
2 cond 256020, 3 branch 415632 1 no_table 256492
3 cond 250690, 1 branch 316983 2 no_table 257726
3 cond 249331, 2 branch 325226 1 no_table 250227
-O0 main.c
1 cond 225019, 2 branch 224297 3 no_table 229554
1 cond 235607, 3 branch 199806 2 no_table 226286
2 cond 226739, 1 branch 210179 3 no_table 238690
2 cond 237532, 3 branch 223877 1 no_table 234103
3 cond 225485, 1 branch 201246 2 no_table 230591
3 cond 228824, 2 branch 202015 1 no_table 226788

no table版本的速度与cmov差不多,但是不允许使用易于实现的本地变量.除非针对零优化中的可预测数据,否则分支算法会更糟.我在那里没有任何解释.

The no table version is as about as fast as the cmov, but doesn't allow for easily implementable locals. The branching algorithm is worse unless on predictable data in zero optimization ? I've got no explanations there.

我将保留cmov版本,该版本既最优雅,又易于更新.感谢您的所有帮助.

I'll keep the cmov version, which is both the most elegant and easily updatable. Thanks for all the help.

推荐答案

cmov是ALU选择操作,它总是在 检查条件之前读取两个源.使用内存源不会改变这一点.如果条件为假,这不像ARM谓词,其行为类似于NOP. cmovz eax, [mem]也无条件地写入 EAX,无论条件如何,零扩展到RAX.

cmov is an ALU select operation that always reads both sources before checking the condition. Using a memory source doesn't change this. It's not like an ARM predicated instruction that acts like a NOP if the condition was false. cmovz eax, [mem] also unconditionally writes EAX, zero-extending into RAX regardless of the condition.

就大多数CPU(无序调度程序等)而言,cmovcc reg, [mem]的处理方式与adc reg, [mem]完全相同:3输入1输出ALU指令. (与cmov不同,adc写入标志,但不要紧记.)微融合内存源操作数是一个单独的uop,恰好是同一x86指令的一部分.这也是ISA规则的工作方式.

As far as the most of the CPU is concerned (the out-of-order scheduler and so on), cmovcc reg, [mem] is handled exactly like adc reg, [mem]: a 3-input 1-output ALU instruction. (adc writes flags, unlike cmov, but nevermind that.) The micro-fused memory source operand is a separate uop that just happens to be part of the same x86 instruction. This is how the ISA rules for it work, too.

实际上,对于cmovz作为selectz

x86的唯一条件负载(不会在错误的地址上出错,只是可能运行缓慢)是

x86's only conditional loads (that don't fault on bad addresses, just potentially run slowly) are:

  • 受条件分支保护的正常负载.分支错误预测或其他导致运行错误的负载的错误推测得到了相当有效的处理(也许开始了页面遍历,但是一旦识别出错误推测,正确的指令流执行就不必等待任何操作了)内存操作由推测执行开始.

  • Normal loads protected by conditional branches. Branch mis-prediction or other mis-speculations leading to running a faulting load are handled fairly efficiently (maybe starting a page walk, but once the mis-speculation is identified, execution of the correct flow of instructions doesn't have to wait for any memory operation started by speculative execution).

如果页面上没有TLB命中,您将无法阅读,那么直到有故障的负载达到报废状态(已知是非推测性的,因此实际上会出现#PF页面错误异常,即不可避免地会很慢).在某些CPU上,这种快速处理会导致Meltdown攻击. >.<参见 http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/.

If there was a TLB hit on a page you can't read, then not much more happens until a faulting load reaches retirement (known to be non-speculative and thus actually taking a #PF page-fault exception which is unavoidably going to be slow). On some CPUs, this fast handling leads to the Meltdown attack. >.< See http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/.

rep lodsd具有RCX = 0或1.(不是快速或高效,但是微代码分支是特殊的,不能从Intel CPU上的分支预测中受益.请参见 REP是做什么设置的?.Andy Glew提到了微代码分支的错误预测,但我认为这些与正常的分支未命中不同,因为似乎有固定的成本.)

rep lodsd with RCX=0 or 1. (not fast or efficient, but microcode branches are special and can't benefit from branch prediction, on Intel CPUs. See What setup does REP do?. Andy Glew mentions microcode branch mispredictions, but I think those are different from normal branch misses because there seems to be a fixed cost.)

AVX2 vpmaskmovd/q /AVX1 vmaskmovps/pd .对于掩码为0的元素,错误被抑制.即使从合法地址开始,掩码也为全0掩码的掩码加载需要使用基数+索引寻址模式的〜200个周期的微码辅助.)请参见

AVX2 vpmaskmovd/q / AVX1 vmaskmovps/pd. Faults are suppressed for elements where the mask is 0. A mask-load with an all-0 mask even from a legal address requires a ~200 cycle microcode assist with a base+index addressing mode.) See section 12.9 CONDITIONAL SIMD PACKED LOADS AND STORES and Table C-8 in Intel's optimization manual. (On Skylake, stores to an illegal address with an all-zero mask also need an assist.)

早期的MMX/SSE2 maskmovdqu 仅用于存储(并具有NT暗示).只有具有dword/qword(而不是字节)元素的类似AVX指令才具有加载形式.

The earlier MMX/SSE2 maskmovdqu is store-only (and has an NT hint). Only the similar AVX instruction with dword/qword (instead of byte) elements has a load form.

AVX512屏蔽的负载

AVX512 masked loads

AVX2会收集一些/所有遮罩元素清除.

AVX2 gathers with some / all mask elements cleared.

...也许还有其他我正在忘记的东西. TSX/RTM事务内部的正常负载:故障中止事务 而不是引发#PF.但是您不能指望错误的索引错误,而不仅仅是从附近的某个地方读取虚假数据,因此这实际上不是有条件负载.它也不是超级快.

... and maybe others I'm forgetting. Normal loads inside TSX / RTM transactions: a fault aborts the transaction instead of raising a #PF. But you can't count on a bad index faulting instead of just reading bogus data from somewhere nearby, so it's not really a conditional load. It's also not super fast.

一种替代方法可能是cmov您无条件使用的地址,选择要从中加载的地址.例如如果您有0可以从其他地方加载,则可以使用.但是随后您必须在不使用寻址模式的寄存器中计算表索引,因此您可以cmov最终地址.

An alternative might be to cmov an address that you use unconditionally, selecting which address to load from. e.g. if you had a 0 to load from somewhere else, that would work. But then you'd have to calculate the table indexing in a register, not using an addressing mode, so you could cmov the final address.

或者只是CMOV索引,并在表的末尾填充一些零字节,以便您可以从table + 128加载.

Or just CMOV the index and pad the table with some zero bytes at the end so you can load from table + 128.

或使用分支,它可能在很多情况下都能很好地预测.但是对于像法语这样的语言,也许不是,您会在普通文本中找到低128位和较高Unicode代码点的组合.

Or use a branch, it will probably predict well for a lot of cases. But maybe not for languages like French where you'll find a mix of low-128 and higher Unicode code-points in common text.

请注意,[rel]仅在寻址模式中不涉及任何寄存器(RIP除外)时才起作用. RIP相对寻址取代了两种冗余方式(以32位代码)之一来编码[disp32].它使用较短的非SIB编码,而ModRM + SIB仍可以在没有寄存器的情况下对绝对[disp32]进行编码. (对于[fs: 16]这样的地址很有用,相对于具有段基础的线程本地存储而言,它的偏移量较小.)

Note that [rel] only works when there's no register (other than RIP) involved in the addressing mode. RIP-relative addressing replaces one of the 2 redundant ways (in 32-bit code) to encode a [disp32]. It uses the shorter non-SIB encoding, while a ModRM+SIB can still encode an absolute [disp32] with no registers. (Useful for addresses like [fs: 16] for small offsets relative to thread-local storage with segment bases.)

如果您只想在可能的情况下使用相对RIP寻址,请在文件顶部使用default rel . [symbol]将是相对于RIP的,但[symbol + rax]不会.不幸的是,NASM和YASM默认为default abs.

If you just want to use RIP-relative addressing when possible, use default rel at the top of your file. [symbol] will be RIP-relative, but [symbol + rax] won't. Unfortunately, NASM and YASM default to default abs.

[reg + disp32]是在位置相关的代码中为静态数据建立索引的一种非常有效的方法,只是请不要自欺欺人地认为它可能是相对于RIP的.请参阅 32位绝对地址x86-64 Linux?.

[reg + disp32] is a very efficient way to index static data in position-dependent code, just don't fool yourself into thinking that it can be RIP-relative. See 32-bit absolute addresses no longer allowed in x86-64 Linux?.

[rel ascii_flags + EDI]也很奇怪,因为您正在以寻址模式使用x86-64代码的32位寄存器.通常没有理由花一个地址大小的前缀将地址截断为32位.

[rel ascii_flags + EDI] is also weird because you're using a 32-bit register in an addressing mode in x86-64 code. There's usually no reason to spend an address-size prefix to truncate addresses to 32-bit.

但是,在这种情况下,如果您的表位于虚拟地址空间的低32位中,并且您的函数arg仅指定为32位(因此,允许调用方在RDI的高32位中保留垃圾),使用[disp32 + edi]而不是mov esi,edi或进行零扩展的东西实际上是一个胜利.如果您是故意这样做的,请明确说明为什么要使用32位寻址模式.

However, in this case if your table is in the low 32-bits of virtual address space, and your function arg is only specified as 32 bits (so the caller is allowed to leave garbage in the upper 32 of RDI), it is actually a win to use [disp32 + edi] instead of a mov esi,edi or something to zero-extend. If you're doing that on purpose, definitely comment why you're using a 32-bit addressing mode.

但是在这种情况下,对索引使用cmov会为您零扩展到64位.

But in this case, using a cmov on the index will zero-extend to 64-bit for you.

使用字节表中的DWORD加载也很奇怪.您偶尔会越过缓存行边界并遭受额外的延迟.

It's also weird to use a DWORD load from a table of bytes. You'll occasionally cross a cache-line boundary and suffer extra latency.

@fuz显示了使用相对于RIP的LEA 索引的CMOV的版本.

@fuz showed a version using a RIP-relative LEA and a CMOV on the index.

在可以使用32位绝对地址的位置相关代码中,请务必使用它来保存指令. [disp32]寻址模式比RIP相对(长1个字节)要差,但是[reg + disp32]寻址模式在与位置相关的代码和32位绝对地址都可以的情况下非常好. (例如x86-64 Linux,但OS X的可执行文件总是映射到低32位之外.)请注意,它不是rel.

In position-dependent code where 32-bit absolute addresses are ok, by all means use it to save instructions. [disp32] addressing modes are worse than RIP-relative (1 byte longer), but [reg + disp32] addressing modes are perfectly fine when position-dependent code and 32-bit absolute addresses are ok. (e.g. x86-64 Linux, but not OS X where executable are always mapped outside the low 32 bits.) Just be aware that it's not rel.

; position-dependent version taking advantage of 32-bit absolute [reg + disp32] addressing
; not usable in shared libraries, only non-PIE executables.
ft_isprint:
    mov     eax, 128               ; offset of dummy entry for "not ASCII"
    cmp     edi, eax               ; check if ascii
    cmovae  edi, eax               ; replace with 128 if outside 0..127
              ; cmov also zero-extends EDI into RDI
    movzx   eax, byte [ascii_flags + rdi] ; load table entry
    and     al, flag_print         ; mask the desired flag
      ; if the caller is only going to read / test AL anyway, might as well save bytes here
    ret

如果表中的任何现有条目都具有您想要用于高输入的相同标志,例如也许您将永远不会在隐式长度字符串中看到条目0,但仍可以将EAX异或为零,并将表保留为128字节,而不是129.

If any existing entry in your table has the same flags you want for high inputs, e.g. maybe entry 0 which you'll never see in implicit-length strings, you could still xor-zero EAX and keep your tables at 128 bytes, not 129.

test r32, imm32占用的代码字节超出了您的需求. ~127 = 0xFFFFFF80将适合符号扩展字节,但不是 TEST r/m32, sign-extended-imm8 编码.但是,cmp有这种编码,就像其他所有立即指令一样.

test r32, imm32 takes more code bytes than you need. ~127 = 0xFFFFFF80 would fit in a sign-extended byte, but is no TEST r/m32, sign-extended-imm8 encoding. There is such an encoding for cmp, though, like essentially all other immediate instructions.

您可以改为使用cmp edi, 127/cmovbe eax, edicmova edi, eax检查127以上的unsigned.这样可以节省3个字节的代码大小.或者我们可以使用cmp reg,reg和用于表索引的128来保存4个字节.

You could instead check for unsigned above 127, with cmp edi, 127 / cmovbe eax, edi or cmova edi, eax. This saves 3 bytes of code-size. Or we can save 4 bytes by using cmp reg,reg using the 128 we used for a table index.

对于大多数人来说,在进行数组索引之前进行范围检查也比通过高位检查更为直观.

and al, imm8仅2个字节,而and r/m32, sign-extended-imm8为3个字节.只要调用者仅读取AL,它就不会在任何CPU上变慢.在Sandybridge之前的Intel CPU上,与AL进行AND运算后读取EAX可能会导致部分寄存器停顿/减速.如果我没记错的话,Sandybridge不会重命名用于读取-修改-写入操作的部分寄存器,并且IvB和更高版本根本不会重命名low8部分寄存器.

and al, imm8 is only 2 bytes, vs. 3 bytes for and r/m32, sign-extended-imm8. It's not slower on any CPUs, as long as the caller only reads AL. On Intel CPUs before Sandybridge, reading EAX after ANDing into AL could cause a partial-register stall / slowdown. Sandybridge doesn't rename partial registers for read-modify-write operations, if I recall correctly, and IvB and later don't rename low8 partial regs at all.

您也可以使用mov al, [table]而不是movzx来保存另一个代码字节.较早的mov eax, 128已经打破了对EAX旧值的任何错误依赖,因此它不应该在性能方面有所下降.但是movzx并不是一个坏主意.

You might also use mov al, [table] instead of movzx to save another code byte. An earlier mov eax, 128 already broke any false dependency on the old value of EAX so it shouldn't have a performance downside. But movzx is not a bad idea.

当所有其他条件都相等时,较小的代码大小几乎总是更好的(对于指令缓存来说,甚至有时在打包到uop缓存中).如果花费额外的成本或引入任何错误的依赖关系,那么在优化速度时就不值得了.

When all else is equal, smaller code-size is almost always better (for instruction-cache footprint, and sometimes for packing into the uop cache). If it cost any extra uops or introduced any false dependencies, it wouldn't be worth it when optimizing for speed, though.

这篇关于由于从越界内存中跳过了cmov,因此难以调试SEGV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆