x64 Windows下的快速光纤/协程 [英] Fast fibers/coroutines under x64 Windows
问题描述
因此,根据我在此处找到的代码,我有了这个扩展的协程API: https://the8bitpimp.wordpress.com/2014/10/21/coroutines-x64-and-visual-studio/
struct mcontext {
U64 regs[8];
U64 stack_pointer;
U64 return_address;
U64 coroutine_return_address;
};
struct costate {
struct mcontext callee;
struct mcontext caller;
U32 state;
};
void coprepare(struct costate **token,
void *stack, U64 stack_size, cofunc_t func); /* C code */
void coenter(struct costate *token, void *arg); /* ASM code */
void coyield(struct costate *token); /* ASM code */
int coresume(struct costate *token); /* ASM code, new */
我坚持执行coyield(). coyield()可以用C编写,但这是我遇到问题的程序集.这是到目前为止我所得到的(MASM/VC ++语法).
;;; function: void _yield(struct mcontext *callee, struct mcontext *caller)
;;; arg0(RCX): callee token
;;; arg2(RDX): caller token
_yield proc
lea RBP, [RCX + 64 * 8]
mov [RCX + 0], R15
mov [RCX + 8], R14
mov [RCX + 16], R13
mov [RCX + 24], R12
mov [RCX + 32], RSI
mov [RCX + 40], RDI
mov [RCX + 48], RBP
mov [RCX + 56], RBX
mov R11, RSP
mov RSP, [RDX + 64]
mov [RDX + 64], R11
mov R15, [RDX + 0]
mov R14, [RDX + 8]
mov R13, [RDX + 16]
mov R12, [RDX + 24]
mov RSI, [RDX + 32]
mov RDI, [RDX + 40]
mov RBP, [RDX + 48]
mov RBX, [RDX + 56]
ret
_yield endp
这是8bitpimp代码的直接改编.如果我正确理解此代码,它不会执行的操作是将mcontext-> return_address和mcontext-> coroutine_return_address放在要由ret弹出的堆栈上.还有,那快吗? IIRC,这会导致现代x64片段中发现的返回分支预测变量不匹配.
此答案仅解决了问题的快速"部分.
返回地址预测
首先,简要描述典型返回地址预测变量的行为.
- 每次创建
call
时,被压入实际堆栈的返回地址也存储在称为返回地址缓冲区之类的CPU结构中. - 进行
ret
(返回)时,CPU假定目标地址将是当前位于返回地址缓冲区顶部的地址,并且弹出"来自返回地址缓冲区的条目.
效果是完美地 1 预测call
/ret
对,只要它们以它们通常正确的嵌套模式出现并且ret
实际上正在删除未推送的未修改返回地址在每种情况下都按call
.有关更多详细信息,您可以从此处开始.
使用C或C ++(或几乎任何其他语言)进行的普通函数调用通常将始终遵循此正确嵌套的模式 2 .因此,您无需执行任何特殊操作即可利用收益预测.
故障模式
在call
/ret
无法正常配对的情况下,预测可能会(至少)以几种不同的方式失败:
- 如果对堆栈指针或堆栈上的返回值进行了操作,以致
ret
不返回相应的call
所压入的位置,则该ret
的分支目标预测失败,但随后的通常嵌套的ret
指令只要正确嵌套即可继续正确预测.例如,如果在函数中为[rsp]
的值添加几个字节以跳过调用函数中call
之后的指令,则下一个ret
会预测错误,但是后面的ret
调用函数内部应该没问题. - 另一方面,
call
和ret
函数未正确嵌套,整个返回预测缓冲区可能会错位,从而导致将来的ret
指令(如果有的话)使用现有值进行错误预测 2.5 .例如,如果您call
进入一个函数,然后使用jmp
返回到调用方,则会出现一个不匹配的call
而没有ret
.调用者内部的ret
会发生错误的预测,调用者内部的ret
也会发生错误的预测,依此类推,直到所有未对齐的值用尽或覆盖 3 .如果您的ret
与相应的呼叫不匹配,也会发生类似的情况(这种情况对于后续分析很重要).
除了上面的两个规则外,您还可以通过跟踪代码并跟踪每个点的返回堆栈,来简单地确定返回预测变量的行为.每当您有一条ret
指令时,请查看它是否返回到返回堆栈的当前顶部-如果没有,则会出现错误预测.
误报成本
错误预测的实际成本取决于周围的代码.通常给出〜20个周期的数字,并且在实践中经常看到,但是实际成本可以较低:例如,如果CPU能够
4 您实际上并未显示 5 在某种程度上,这些术语甚至在对称couroutine实现中都有意义,因为在这种情况下,实际上没有调用者和被调用者的绝对概念. 6 当然,此分析仅适用于简单的情况,即您有一个 7 一个或两个例外是第一次调用: So I have this coroutine API, extended by me, based on code I found here: https://the8bitpimp.wordpress.com/2014/10/21/coroutines-x64-and-visual-studio/ I'm stuck on implementing coyield(). coyield() can be written in C, but it's the assembly that I'm having problems with. Here's what I got so far (MASM/VC++ syntax). This is a straight forward adaption of 8bitpimp's code. What it doesn't do, if I understand this code correctly, is put mcontext->return_address and mcontext->coroutine_return_address on the stack to be popped by the ret. Also, is that fast? IIRC, it causes a mismatch on the return branch predictor found in modern x64 pieces. This answers only addresses the "is it fast" part of the question. First, a brief description of the behavior of a typical return-address predictor. The effect is to perfectly1 predict Normal function calls in C or C++ (or pretty much any other language) will generally always follow this properly nested pattern2. So you don't need to do anything special to take advantage of the return prediction. In cases where Rather than the two rules above , you can also simply determine the behavior of the return predictor by tracing through the code and tracking what the return stack looks like at each point. Every time you have a The actual cost of a misprediction depends on the surrounding code. A figure of ~20 cycles is commonly given and is often seen in practice, but the actual cost can be lower: e.g., as low as zero if the CPU is able to resolve the misprediction early and and start fetching along the new path without interrupting the critical path, or higher: e.g., if the branch prediction failures take a long time to resolve and reduce the effective parallelism of long-latency operations. Regardless we can say that the penalty is usually significant when it occurs in an operation that other takes only a handful of instructions. The existing For example, consider the case where some function Now consider what happens when So the bad news is that a tight series of calls to So can we fix the misprediction? Unfortunately, it's tricky in the combination of C and external ASM calls, because making the call to Still, let's try. One approach is get of using That is, just replace the This is functionally equivalent to If analyze the repeated sequence of How that predictor works depends on how So this should generally predict a series of tight That's the best I can think of within the constraint of external calls to functions written in ASM, because you already have an implied If you can inline code at the call-site of your couroutine methods (e.g., with compiler support or inline asm), then you can perhaps do better. The call to Note that Then, When a So essentially I've glossed over some details of this approach: for example, since there is no function call involved, the ABI-specified non-volatile registers aren't really special: in the case of inline assembly you'll need to indicate to the compiler which variables you will clobber and save the rest, but you can choose whatever set is convenient for you. Choosing a larger set of clobbered variables makes the 1 Of course, there are limitations in practice: the size of the return stack buffer is likely limited to some small number (e.g., 16 or 24) so once the depth of the call stack exceeds that, some return addresses are lost and won't be correctly predicted. Also, various events like a context switch or interrupt are likely to mess up the return-stack predictor. 2 An interesting exception was a common pattern for reading the current instruction pointer in x86 (32-bit) code: there is no instruction to do this directly, so instead a 2.5 How many mispredictions this implies depends on how may net returns the calling function does: if it immediately starts calling down another deep chain of calls, the misaligned return stack entries may never be used at all, for example. 3 Or, perhaps, until the return address stack is re-aligned by a 4 You haven't actually shown how 5 To the extent these terms even make sense in a symmetric couroutine implementation, since there is in fact no absolute notion of caller and callee in that case. 6 Of course, this analysis applies only to the simple case that you have a single 7 One exception would be the first call or two: the 这篇关于x64 Windows下的快速光纤/协程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!coyield
和coresume
如何实际调用_yield
,因此对于其余的问题,我将假定它们实质上是作为coyield
或coresume
中,而无需调用_yield
:即,将_yield
代码复制并粘贴到每个函数中,可以进行一些小的编辑以解决差异.您也可以通过调用_yield
来完成这项工作,但是随后又有了一层额外的调用和提示,使分析变得复杂.coresume
调用通过一个coyield
调用进入了协程.更复杂的情况是可能的,例如被叫方内部的多个coyield
呼叫,或呼叫方内部的多个coresume
呼叫(可能针对不同的协程).但是,使用相同的模式:jmp r11
站点分裂的情况将比合并的情况更简单(可能以更多的iBTB资源为代价).ret
预测变量不需要热身",但间接分支预测变量可能会发生,尤其是在此期间已调用另一个协程的情况下./p>struct mcontext {
U64 regs[8];
U64 stack_pointer;
U64 return_address;
U64 coroutine_return_address;
};
struct costate {
struct mcontext callee;
struct mcontext caller;
U32 state;
};
void coprepare(struct costate **token,
void *stack, U64 stack_size, cofunc_t func); /* C code */
void coenter(struct costate *token, void *arg); /* ASM code */
void coyield(struct costate *token); /* ASM code */
int coresume(struct costate *token); /* ASM code, new */
;;; function: void _yield(struct mcontext *callee, struct mcontext *caller)
;;; arg0(RCX): callee token
;;; arg2(RDX): caller token
_yield proc
lea RBP, [RCX + 64 * 8]
mov [RCX + 0], R15
mov [RCX + 8], R14
mov [RCX + 16], R13
mov [RCX + 24], R12
mov [RCX + 32], RSI
mov [RCX + 40], RDI
mov [RCX + 48], RBP
mov [RCX + 56], RBX
mov R11, RSP
mov RSP, [RDX + 64]
mov [RDX + 64], R11
mov R15, [RDX + 0]
mov R14, [RDX + 8]
mov R13, [RDX + 16]
mov R12, [RDX + 24]
mov RSI, [RDX + 32]
mov RDI, [RDX + 40]
mov RBP, [RDX + 48]
mov RBX, [RDX + 56]
ret
_yield endp
Return Address Prediction
call
is made, the return address that is pushed on the actual stack is also stored inside a CPU structure called the return address buffer or something like that.ret
(return) is made, the CPU assumes the destination will be the address currently at the top of the return address buffer, and that entry from return address buffer is "popped".call
/ret
pairs, as long as they occur in their usual properly nested pattern and that ret
is actually removing the unmodified return address pushed by call
in each case. For more details you can start here.Failure Modes
call
/ret
aren't paired up normally, the predictions can fail in (at least) a couple of different ways:
ret
doesn't return the place that the corresponding call
pushed, you'll get a branch target prediction failure for that ret
, but subsequent normally nested ret
instructions will continue to predict correctly as long as they are correctly nested. For example, if at function you add a few bytes to the value at [rsp]
in order to skip over the instruction following the call
in the calling function, the next ret
will mispredict, but the ret
that follows inside the calling function should be fine.call
and ret
functions aren't properly nested, the whole return prediction buffer can become misaligned, causing future ret
instructions, if any, that use the existing values to mispredict2.5. For example, if you call
into a function, but then use jmp
to return to the caller, there is a mismatched call
without a ret
. The ret
inside the caller will mispredict, and so will the ret
inside the caller of the caller, and so on, until all misaligned values are used up or overwritten3. A similar case would occur if you had a ret
not matched with a corresponding call (and this case is important for the subsequent analysis).ret
instruction, see if it returns to the current top of the return stack - if not, you'll get a misprediction.Misprediction Cost
Fast Coroutines
Existing Behavior for Coresume and Coyield
_yield
(context switch) function swaps the stack pointer rsp
and then uses ret
to return to a different location than what the actually caller pushed (in particular, it returns to location that was pushed onto the caller
stack when the caller called yield
earlier). This will generally cause a misprediction at the ret
inside _yield
.A0
makes a normal function call to A1
, which it turn calls coresume
4 to resume a coroutine B1
, which later calls coyield
to yield back to A1
. Inside the call to coresume
, the return stack looks like A0, A1
, but then coresume
swaps rsp
to point to the stack for B1
and the top value of that stack is an address inside B1
immediately following coyield
in the code for B1
. The ret
inside coresume
hence jumps to a point in B1
, and not to a point in A1
as the return stack expects. Hence you get a mis-prediction on that ret
and the return stack looks like A0
.B1
calls coyield
, which is implemented in basically the same way coresume
: the call to coyield
pushes B1
on the return stack which now looks like A0, B1
and then swaps the stack to point to A1
stack and then does the ret
which will return to A1
. So the ret
mispredicition will happen in the same way, and the stack is left as A0
. coresume
and coyield
(as is typical with a yield-based iterator, for example), will mispredict each time. The good news is that now inside A1
at least the return stack is correct (not misaligned) - if A1
returns to its caller A0
, the return is correctly predicted (and so on when A0
returns to its caller, etc). So you suffer a mispredict penalty each time, but at least you don't misalign the return stack in this scenario. The relative importance of this depends on how often you are calling coresume
/coyield
versus calling functions normally in the below the function that is calling coresume
.Making It Fast
coresume
or coyield
implies a call inserted by the compiler, and it's hard to unwind this in the asm.Use Indirect Calls
ret
at all and just use indirect jumps.ret
at the end of your coresume
and coyield
calls with:pop r11
jmp r11
ret
, but affects the return stack buffer differently (in particular, it doesn't affect it).coresume
and coyield
calls as above, we get the result that the return stack buffer just starts growing indefinitely like A0, A1, B1, A1, B1, ...
. This occurs because in fact we aren't using the ret
at all in this implementation. So we don't suffer return mis-predictions, because we aren't using ret
! Instead, we rely on the accuracy of the indirect branch predictor to predict jmp11
.coresume
and coyeild
are implemented. If they both call a shared _yield
function that isn't inlined there is only a single jmp r11
location and this jmp
will alternately go to a location in A1
and B1
. Most modern indirect predictors will repredict this simple repeating pattern fine, although older ones which only tracked a single location will not. If _yield
got inlined into coresume
and coyield
or you just copy-pasted the code into each function, there are two distinct jmp r11
call sites, each which only see a single location each, and should be well-predicted by any CPU with an indirect branch predictor6.coyield
and coresume
calls well7, but at the cost of obliterating the return buffer, so when A1
decides to return to A0
this will be mispredicted as well as subsequent returns by A0
and so on. The size of this penalty is bounded above by the size of the return stack buffer, so if you are making many tight coresume/yield
calls this may be a good tradeoff.call
for your co
routines, and you have to make the jump to the other couroutine from inside there and I can't see how to keep the stacks balanced and return to the correct location with those constraints.Inlined Code at the Call Site
coresume
could be inlined as something like this (I've omitted the register saving and restoring code because that's straightforward):; rcx - current context
; rdc - context for coroutine we are about to resume
; save current non-volatile regs (not shown)
; load non-volatile regs for dest (not shown)
lea r11, [rsp - 8]
mov [rcx + 64], r11 ; save current stack pointer
mov r11, [rdx + 64] ; load dest stack pointer
call [r11]
coresume
doens't actually do the stack swap - it just loads the destination stack into r11
and then does a call
against [r11]
to jump to the coroutine. This is necessary so that that call
correctly pushes location we should return to on the stack of the caller.coyield
would look something like (inlined into the calling function):; save current non-volatile regs (not shown)
; load non-volatile regs for dest (not shown)
lea r11, [after_ret]
push r11 ; save the return point on the stack
mov rsp, [rdx + 64] ; load the destination stack
ret
after_ret:
mov rsp, r11
coresume
call jumps to the coroutine it ends up at after_ret
, and before executing the user code the mov rsp, r11
instruction swaps to the proper stack for the coroutine which has been stashed in r11
by coresume
.coyield
has two parts: the top half executed before the yield (which occurs at the ret
call) and the bottom half which completes the work started by coresume
. This allows you to use call
as the mechanism to do the coresume
jump and ret
to do the coyield
jump. The call
/ret
are balanced in this case.coresume
/coyield
code sequences themselves shorter, but potentially puts more register pressure on the surrounding code and may force the compiler to spill more surrounding you code. Perhaps the ideal is just to declare everything clobbered and then the compiler will just spill what it needs.
call next; next: pop rax
sequence can be used: a call
to the next instruction which serves only the push the address on the stack which is popped off. There is no corresponding ret
. Current CPUs actually recognize this pattern however and don't unbalance the return-address predictor in this special case.ret
without a corresponding call, a case of "two wrongs make a right".coyield
and coresume
actually call _yield
, so for the rest of the question I'll assume that they are implemented essentially as _yield
is, directly within coyield
or coresume
without calling _yield
: i.e., copy and paste the _yield
code into each function, possible with some small edits to account for the difference. You can also make this work by calling _yield
, but then you have an additional layer of calls and rets that complicates the analysis.coresume
call calling into a coroutine with a single coyield
call. More complex scenarios are possible, such as multiple coyield
calls inside the callee, or multiple coresume
calls inside the caller (possibly to different couroutines). However, the same pattern applies: the case with split jmp r11
sites will present a simpler steam than the combined case (possibly at the cost of more iBTB resources).ret
predictor needs no "warmup" but the indirect branch predictor may, especially when another coroutine has been called in the interim.