只通过一次if语句 [英] Only pass if-statement once

查看：136 发布时间：2020/5/21 21:11:55 c++ assembly optimization x86 micro-optimization

本文介绍了只通过一次if语句的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在构建一个内核，并且具有一个if语句，该语句可能(在最坏的情况下)运行几百万次.但是，第一次运行后结果很明显. 知道cmp的结果存储在寄存器中，是否有一种方法可以记住上述语句的结果以使其不经常运行? acpi_version已得到保证，永不更改.

I am currently building a kernel, and have an if-statement that could (at worst) run a few million times. Yet, the result is clear after the first run. Knowing that the result of cmp is stored in a register, is there a way of remembering the result of abovementioned statement in order to not run it more often? acpi_version is GUARANTEED to never change.

SDT::generic_sdt* sdt_wrapper::get_table (size_t index) { //function is run many times with varying index
    if (index >= number_tables) { //no such index
        return NULL;
    }

    if (acpi_version == 0) [[unlikely]] {
        return (SDT::generic_sdt*) (rsdt_ptr -> PointerToOtherSDT [index]);
    } else [[likely]] {
        return (SDT::generic_sdt*) (xsdt_ptr -> PointerToOtherSDT [index]);
    }
}

您可以看到，(至少对我而言)没有出现摆脱这种陈述的方法.

As you can, see there is (atleast to me) no appearent way of getting out of having to do that statement.

我尝试使用以下ASM-"HACK":

An attempt I tried was using the following ASM-"HACK":

static inline uint32_t store_cmp_result (uint32_t value1, uint32_t value2) {
    uint32_t zf;
    asm volatile ( "cmp %0, %1" :: "a" (value1), "a" (value2) );
    asm volatile ( "mov %ZF, [WORD]%0" :: "a" (zf) );   
    return zf;
} 

static inline void prestored_condition (uint32_t pres, void (*true_func)(), void (*false_func) ()) {
    asm volatile ( "mov %0, %1" :: "a" (pres)  "Nd=" () );
    asm volatile ( "je %0" :: "a" (&true_func) );
    asm volatile ( "jne %0" :: "a" (&false_func) );
}

但是，这只是一个骇人听闻的解决方案(该方法并没有真正起作用，因此我将其丢弃了.)

Yet, it's just a hackish solution (that didnt really work, so I discarded it).

现在要问的是:

我如何才能忽略if语句被调用一次后仅使用最后一次的输出?

推荐答案

在acpi_version == 0上由编译器生成的cmp/jcc与通常情况下的价格一样便宜.它应该很好地进行预测，因为分支每次都总是以相同的方式进行，并且分支本身的成本非常低，即使每次使用也是如此. (由于分支对前端获取/解码阶段有影响，并且会在更多的I高速缓存行中分解代码的使用部分，因此，分支比未采用的分支的成本略高.)

A compiler-generated cmp/jcc on acpi_version == 0 is about as cheap as you're going to get in the general case. It should predict very well because the branch always goes the same way every time, and the cost of a branch itself is quite low, even if it's taken every time. (Taken branches have slightly more cost than not-taken branches, because of their effect on the front-end fetch/decode stages and because it breaks up the used parts of the code across more I-cache lines.)

CPU无法以比测试整数是否为非零更快的任何特殊方式存储比较结果. (即已经为零/非零的整数是存储比较结果的方式！！)¹

CPUs can't store compare results in any special way that's faster than testing an integer for non-zero. (i.e. a zero / non-zero integer already is how you would store a compare result!)¹

在if的两侧非常相似的特定情况下，可能有节省空间，但是易于预测的compare + branch 非常便宜.程序中充满了Compare + Branch，因此现代CPU必须非常擅长运行它们.正常程序中指令计数的10％到25％之类的内容是compare&分支，包括无条件跳转(不要用确切的数字引用我，我已经听说过，但是无法通过快速搜索找到可靠的来源).更多的分支会占用更多的分支预测资源，或者平均而言会恶化其他分支的预测，但这也很小.

In the specific case where the two sides of your if are very similar, there might be room for savings, but an easily-predicted compare+branch is very cheap. Programs are full of compare+branch, so modern CPUs have to be very good at running them. Something like 10 to 25% of the instruction count in a normal program is compare&branch, including unconditional jumps (don't quote me on the exact number, I've heard this but couldn't find a reliable source with a quick search). More branches takes up slightly more branch prediction resources, or on average worsens the prediction of other branches, but this is a small effect, too.

内联后，编译器已经可以取消循环检查. (到目前为止，您可以在此处执行的最重要的操作是，将小型访问器函数(例如sdt_wrapper::get_table)可以内联，方法是将其放入.h或使用链接时优化.)内联asm可以只会使情况变得更糟( http://gcc.gnu.org/wiki/DontUseInlineAsm )，除非您会进行一些超级黑客操作，例如在asm代码中放置标签，以便对其进行修改或其他操作.

The compiler already will hoist the check out of loops after inlining. (By far the most important thing you can do here is make sure small accessor functions like sdt_wrapper::get_table can inline, either by putting them in the .h or using link-time optimization) Inline asm can only make it worse (http://gcc.gnu.org/wiki/DontUseInlineAsm) unless you do some super-hack like putting a label in the asm code so you can modify it or something.

如果您进行了比较，以至于您认为将acpi_version保留在专用于此的固定全局寄存器中是值得的((全局寄存器变量，GNU C ++支持，但可能实际上是很好的²(即使您认为可能很好)，也可以改为将条件作为所有代码的模板参数(或static constexpr或CPP宏)，并构建2个版本的代码:一个版本为true，另一个版本为false .当您在启动时找到条件的值时，请取消映射并回收包含永远不会运行的版本的页面，然后跳转到将要运行的版本. (或者对于非内核，即在OS下在用户空间中运行的普通程序，仅留下干净的页面映射，尤其是如果未触摸(包括通过运行时重定位)，通常不是性能问题). if(acpi_version == 0) { rest_of_kernel<0>(...); } else { rest_of_kernel<1>(...); }(忽略取消映射/自由部分).

If you compare so much that you think it would be worth keeping acpi_version in a fixed global register dedicated only to that, (a global register variable, which GNU C++ supports but which would probably not actually be good² even if you think it might be), then you could instead make the condition a template parameter for all your code (or a static constexpr or CPP macro), and build 2 versions of your code: one for true and one for false. When you find out the value of the condition on bootup, unmap and reclaim the pages holding the version that will never run, and jump to the version that will run. (Or for a non-kernel, i.e. a normal program running in userspace under an OS, it's usually not a perf problem to just leave clean pages mapped, especially if untouched (including by runtime relocation)). if(acpi_version == 0) { rest_of_kernel<0>(...); } else { rest_of_kernel<1>(...); } (omitting the unmap / free part).

如果rsdt_ptr和xsdt_ptr是不变的，则如果两个PointerToOtherSDT数组(简称PTOS)都在静态存储中，则至少可以消除额外的间接级别.

If rsdt_ptr and xsdt_ptr are invariant, you could at least eliminate that extra level of indirection, if both PointerToOtherSDT arrays (PTOS for short) are in static storage.

您没有标记架构，但是您的(以多种方式折断)asm似乎是x86，所以我将在此进行讨论. (所有现代的x86 CPU都具有乱序执行和非常好的分支预测能力，因此，在那儿可能没有太多收获.)

You didn't tag an architecture, but your (broken in too many ways to mention) asm appears to be x86 so I'll talk about that. (All modern x86 CPUs have out-of-order execution and very good branch prediction, so there's probably not much to gain there, though.)

Linux内核可以做到这一点，但是很复杂:例如类似于.pushsection list_of_addresses_to_patch; .quad .Lthis_instance%= ; .popsection的方法，以在所有内联asm语句的地方构建一个指向需要修补的地方的指针数组(作为特殊的链接器节).使用此技巧的一个地方是，在运行由SMP支持编译的内核的单处理器计算机上，将lock前缀修补为nop.此修补程序在启动时发生一次. (而且，即使互斥量计数器仍在维护中，它甚至可以在热添加CPU之前修补lock前缀.)

The Linux kernel does this, but it's complex: e.g. something like .pushsection list_of_addresses_to_patch; .quad .Lthis_instance%= ; .popsection to build an array of pointers (as a special linker section) to places that need to be patched, everywhere the asm statement inlined. One place this trick is used is to patch lock prefixes to nop on uniprocessor machines running a kernel compiled with SMP support. This patching happens once, at bootup. (And it may even be able to patch the lock prefixes back in before hot-adding a CPU, because the mutex counters are still being maintained.)

事实上， Linux甚至在/.previous记录每个jmp的位置以及补丁长度/位置.它看起来很聪明，可以与2字节短vs 5字节长jmp rel8/jmp rel32跳转一起工作.因此，内核可以修补此代码结束的所有位置，将jmp替换为jmp到正确的位置，或者替换为nop以落入t_yes: return true标签.当您编写if(_static_cpu_has(constant)) { ... }时，gcc可以很好地进行编译.在执行CPU功能检测之后的某个时间点进行修补之后，您最终只会得到一个NOP，然后陷入循环体内. (或者可能有多个简短的NOP指令，我没有检查，但希望不会！)

In fact, Linux even uses asm goto and patches between a jmp or nop for invariant conditions like yours that are determined once at bootup, in bool _static_cpu_has(u16 bit) in arch/x86/include/asm/cpufeature.h. To start with, there's a jmp to a block that does a normal runtime check, testing a bit. But it uses .section .altinstructions,"a" / .previous to record where each jmp is, and the patch length / location. It appears cleverly constructed to work with 2-byte short vs. 5-byte long jmp rel8 / jmp rel32 jumps. So the kernel can patch all the locations where this code ends up, replacing the jmp with either a jmp to the right spot or a nop to fall through to the t_yes: return true label. gcc compiles this fairly nicely when you write if(_static_cpu_has(constant)) { ... }. After patching at some point after doing CPU feature detection, you end up with just a NOP and then fall into the loop body. (Or possibly multiple short-NOP instructions, I didn't check, but hopefully not!)

这真是太酷了，所以我将要复制代码，因为看到这种创造性的内联汇编用法很有趣.我没有寻找进行修补的代码，但显然+链接程序脚本是此操作的其他关键部分.在这种情况下，我不是试图提供一个可行的版本，只是表明该技术是可能，并且可以在哪里找到可以实现的GPLv2实现

This is pretty damn cool, so I'm just going to copy the code because it's fun to see such a creative use of inline asm. I haven't looked for the code that does the patching, but obviously that + the linker script are the other key parts of this. I'm not trying to provide a workable version of it for this case, just show that the technique is possible, and where to find a GPLv2 implementation of it you could copy.

// from Linux 4.16 arch/x86/include/asm/cpufeature.h /* * Static testing of CPU features. Used the same as boot_cpu_has(). * These will statically patch the target code for additional * performance. */ static __always_inline __pure bool _static_cpu_has(u16 bit) { asm_volatile_goto("1: jmp 6f\n" "2:\n" ".skip -(((5f-4f) - (2b-1b)) > 0) * " "((5f-4f) - (2b-1b)),0x90\n" "3:\n" ".section .altinstructions,\"a\"\n" " .long 1b - .\n" /* src offset */ " .long 4f - .\n" /* repl offset */ " .word %P[always]\n" /* always replace */ " .byte 3b - 1b\n" /* src len */ " .byte 5f - 4f\n" /* repl len */ " .byte 3b - 2b\n" /* pad len */ ".previous\n" ".section .altinstr_replacement,\"ax\"\n" "4: jmp %l[t_no]\n" "5:\n" ".previous\n" ".section .altinstructions,\"a\"\n" " .long 1b - .\n" /* src offset */ " .long 0\n" /* no replacement */ " .word %P[feature]\n" /* feature bit */ " .byte 3b - 1b\n" /* src len */ " .byte 0\n" /* repl len */ " .byte 0\n" /* pad len */ ".previous\n" ".section .altinstr_aux,\"ax\"\n" "6:\n" " testb %[bitnum],%[cap_byte]\n" " jnz %l[t_yes]\n" " jmp %l[t_no]\n" ".previous\n" : : [feature] "i" (bit), [always] "i" (X86_FEATURE_ALWAYS), [bitnum] "i" (1 << (bit & 7)), [cap_byte] "m" (((const char *)boot_cpu_data.x86_capability)[bit >> 3]) : : t_yes, t_no); t_yes: return true; t_no: return false; }

针对您的具体情况的运行时二进制修补程序

在您的特定情况下，两个版本之间的区别在于您要取消引用的是哪个global(?)指针以及PTOS 的类型.使用纯C ++可以很容易地存储指向正确数组基址的指针(作为void*或char*)，但是以不同的方式进行索引则很棘手. 在您的情况下，它是uint32_t或uint64_t ，作为结构末尾的灵活数组成员. (实际上uint32_t PTOS[1]是因为ISO C ++不支持灵活的数组成员，但是如果您要使用GNU内联asm语法，那么像uint32_t PTOS[]这样的适当的灵活数组成员可能是个好主意.)

Runtime binary patching for your specific case

In your specific case, the difference between your two versions is which global(?) pointer you're dereferencing, and the type of PTOS. Storing a pointer to the base of the right array (as a void* or char*) is easy with pure C++, but indexing differently is tricky. In your case, it's an array of uint32_t or uint64_t, as a flexible array member at the end of a struct. (Actually uint32_t PTOS[1] because ISO C++ doesn't support flexible array members, but if you're going to use GNU inline asm syntax, you a proper flexible array member like uint32_t PTOS[] is probably a good idea).

在x86-64上，将索引寻址模式中的比例因子从4更改为8可以解决问题，因为64位负载与零扩展32位负载使用相同的操作码，即REX.对于操作数大小，W = 0(或没有REX前缀)与REX.W = 1. .byte 0x40; mov eax, [rdx + rdi*4]具有与mov rax, [rdx + rdi*8]相同的长度. (第一个0x40字节是一个REX前缀，所有位都清零.第二个版本的REX.W = 1用于64位操作数大小；第一个零通过写入EAX扩展到RAX中.如果第一个版本已经为r10这样的寄存器使用了REX前缀，那么它已经有一个REX前缀.)无论如何，如果您知道所有相关指令的位置，将一个补丁修补到另一个补丁将很容易.

On x86-64, changing the scale factor in an indexed addressing mode from 4 to 8 would do the trick, because a 64-bit load vs. a zero-extending 32-bit load uses the same opcode, just REX.W=0 (or no REX prefix) vs. REX.W=1 for the operand size. .byte 0x40; mov eax, [rdx + rdi*4] has the same length as mov rax, [rdx + rdi*8]. (The 0x40 byte in the first is a REX prefix with all its bits clear. The 2nd version needs REX.W=1 for 64-bit operand size; the first one zero extends into RAX by writing EAX. If the first version already needed a REX prefix for a register like r10, it would already have a REX prefix.) Anyway, patching one to the other would be easy if you knew where all the relevant instructions were located.

如果您有用于记录要修补位置的基础结构，则可以使用它来修补mov指令，该指令在寄存器中获取表指针和index，并返回64位值(从(32或64位负载).(也不要忘记使用虚拟输入来告诉编译器您实际读取了表指针所指向的内存，否则允许编译器进行优化，可能会破坏您的表指针).代码，例如跨asm语句移动商店).但是您必须要小心；内联asm可能会通过禁用恒定传播(例如，对于index)而损害优化.至少如果省略volatile，则允许编译器将其视为输入的纯函数，并对其进行CSE.

If you had an infrastructure for recording places to patch, you'd use it to patch a mov instruction that gets a table pointer and index in registers, and returns a 64-bit value (from a 32 or 64-bit load). (And don't forget a dummy input to tell the compiler you actually read the memory pointed to by the table pointer, otherwise the compiler is allowed to do optimizations that could break your code like moving stores across the asm statement). But you'd have to be careful; inline asm can hurt optimization by disabling constant propagation (e.g. for index). At least if you omit volatile, the compiler is allowed to consider it a pure function of the inputs and CSE it.

在x86上，寻址中的比例因子必须编码到指令中.即使在运行时不变的情况下，您(或编译器)仍需要进行变量计数移位或乘法运算，以在不进行自我修改的代码(编译器不会发出)的情况下实现这一目标.

On x86, the scale-factor in an addressing has to be coded into the instruction. Even with a runtime-invariant, you (or the compiler) would still need a variable-count shift, or a multiply, to pull this off without self-modifying code (which compilers won't emit).

在Intel Sandybridge系列CPU上，可变计数移位的成本为3 uop( http://agner.org/优化/)(由于旧版CISC语义； count = 0使EFLAGS保持不变，因此EFLAGS是变量计数移位的输入.)除非您让编译器将BMI2用于shlx(无标志移位). index += foo ? 0 : index有条件地将index加倍(相差一个移位计数)，但是在可以预测得很好的条件下，在x86上进行无分支操作是不值得的.

A variable-count shift costs 3 uops on an Intel Sandybridge-family CPU (http://agner.org/optimize/) (because of legacy CISC semantics; count=0 leaves EFLAGS unmodified so EFLAGS is an input to variable-count shifts.) Unless you let the compiler use BMI2 for shlx (flagless shifts). index += foo ? 0 : index would conditionally double index (a difference of one shift count), but doing that branchlessly is not worth it on x86 for a condition that predicts that well.

使用可变计数移位而不是缩放索引寻址模式可能比预测良好的条件分支的成本更高.

uint64_t与uint32_t而不是运行时修补程序是另一个问题.一个版本需要进行零扩展的32位加载，而另一个版本则需要进行64位加载(除非高字节对您而言总是为零?)我们可以总是这样做64位加载，然后屏蔽以保留或丢弃高32位，但这需要另一个常数.如果负载越过缓存行(或更糟的是页面)边界，则可能会导致性能下降.例如如果32位值是页面中的最后一个值，则只需加载正常的32位值即可，但是需要64位load + mask来加载下一页的数据.

uint64_t vs. uint32_t without runtime patching is another problem; one version needs to do a zero-extending 32-bit load, and the other needs to do a 64-bit load (unless the upper bytes happen to always be zero for you?) We could always do a 64-bit load and then mask to keep or discard the upper 32 bits, but that needs another constant. And it can suffer a performance penalty if the load crosses a cache-line (or worse, page) boundary. e.g. if the 32-bit value was the last one in a page, a normal 32-bit load would have just loaded it, but a 64-bit load + mask would need to load data from the next page.

但同时考虑到这两个因素，确实不值得.只是为了好玩，您可以执行以下操作:

But with both of these things taken together, it's really not worth it. Just for fun, here's what you could do: source + asm output on the Godbolt compiler explorer

// I'm assuming rsdt_ptr and xsdt_ptr are invariants, for simplicity static const char *selected_PTOS; static uint64_t opsize_mask; // 0x00000000FFFFFFFF or all-ones static unsigned idx_scale; // 2 or 3 // set the above when the value for acpi_version is found void init_acpi_ver(int acpi_version) { ... set the static vars; } // branchless but slower than branching on a very-predictable condition! SDT::generic_sdt* sdt_wrapper::get_table (size_t index) { const char *addr = selected_PTOS + (index << idx_scale); uint64_t entry = *reinterpret_cast<const uint64_t*>(addr); entry &= opsize_mask; // zero-extend if needed return reinterpret_cast<SDT::generic_sdt*>(entry); }

Godbolt的Asm输出(类型比较简单，因此实际上可以编译)

Asm output from Godbolt (with simpler types so it actually compiles)

get_table(unsigned long): mov ecx, DWORD PTR idx_scale[rip] mov rax, QWORD PTR selected_PTOS[rip] # the table sal rdi, cl mov rax, QWORD PTR [rax+rdi] # load the actual data we want and rax, QWORD PTR opsize_mask[rip] ret

使用内联和CSE，编译器可以将一些掩码和移位计数值保留在寄存器中，但这仍然是额外的工作(并绑定了寄存器).

With inlining and CSE, the compiler could keep some of those mask and shift-count values in registers, but that's still extra work (and ties up registers).

还有，顺便说一句，不要不使static vars locals在函数内；这将迫使编译器每次检查是否是第一次执行该函数. static local的快速路径(一旦从初始化代码中清除了灰尘，所有路径就会运行)非常便宜，但是的成本与您要避免的成本相同:非零整数上的分支！

And BTW, do not make the static vars locals inside a function; that would force the compiler to check every time to see if it was the first time the function had executed. The fast path for a static local (all that runs once the dust has settled from the init code) is pretty cheap, but about the same cost as what you're trying to avoid: a branch on an integer being non-zero!

int static_local_example() { static int x = ext(); return x; } # gcc7.3 movzx eax, BYTE PTR guard variable for static_local_example()::x[rip] test al, al je .L11 # x86 loads are always acquire-loads, other ISAs would need a barrier after loading the guard mov eax, DWORD PTR static_local_example()::x[rip] ret

静态函数指针(在类或文件范围，而不是函数)值得考虑，但是用无条件间接调用替换条件分支不太可能会获胜.然后，您将产生函数调用开销(寄存器溢出，arg传递). 作为优化，编译器通常会尝试虚拟化回到条件分支！

A static function pointer (at class or file scope, not function) is worth considering, but replacing a conditional branch with an unconditional indirect call is unlikely to be a win. And then you have function-call overhead (clobbered registers, arg-passing). Compilers would normally try to devirtualize back into a conditional branch as an optimization!

脚注1 :如果您的条件是acpi_version == 4，则MIPS可以保存一条存储0/1结果的指令.它没有比较成标记，而是它具有比较-寄存器和与零或一个寄存器进行比较的分支指令，以及一个已读为零的寄存器.即使在x86上，如果值已经在寄存器中(

Footnote 1: If your condition had been acpi_version == 4, then MIPS could have saved one instruction from storing a 0 / 1 result. Instead of comparing into flags, it has compare-into-register, and branch instructions that compare against zero or a register, and a register that already reads as zero. Even on x86, comparing for zero / non-zero saves a byte of code-size if the value is already in a register (test eax,eax vs. cmp eax,4). It would save more if it was the result of an ALU instruction (so ZF would already be set), but it's not.

但是大多数其他体系结构都比较成标志，并且您不能从内存直接加载到标志中.因此，如果acpi_version的比较开销很大，例如比32位计算机上的__int128或int64_t等寄存器宽的整数，则只想存储静态的bool结果.

But most other architectures compare into flags, and you can't load from memory directly into flags. So you'd only want to store a static bool result if acpi_version was expensive to compare, like an integer wider than a register such as __int128, or int64_t on a 32-bit machine.

脚注2 :对于acpi_version，请勿使用全局寄存器变量；那真是愚蠢.如果到处使用它，那么希望链接时优化可以很好地促进事物之间的比较.

Footnote 2: Don't use a global register variable for acpi_version; that would be silly. If it's used everywhere, then hopefully link-time optimization can do a good job at hoisting the comparison out of things.

分支预测+推测执行意味着分支时CPU实际上不必等待加载结果，并且如果一直读取它，它将始终在L1d缓存中保持高温. (推测执行意味着假设正确的分支预测，控制依赖项就不是关键路径的一部分)

Branch prediction + speculative execution means that the CPU doesn't actually have to wait for a load result when you branch on it, and if you read it all the time it will stay hot in L1d cache anyway. (Speculative execution means that control dependencies aren't part of the critical path, assuming correct branch prediction)

PS:如果您已经了解了所有内容，那么您应该考虑像Linux那样在一些经常检查的情况下使用二进制补丁程序.如果没有，您可能不应该！

PS: if you made it this far and understood everything, then you should consider using binary patching like Linux does for a few frequently-checked conditions. If not, you probably shouldn't!

这篇关于只通过一次if语句的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

只通过一次if语句 [英] Only pass if-statement once

问题描述

推荐答案

针对您的具体情况的运行时二进制修补程序

Runtime binary patching for your specific case

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

只通过一次if语句 [英] Only pass if-statement once

问题描述

推荐答案

针对您的具体情况的运行时二进制修补程序

Runtime binary patching for your specific case

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭