x64 Windows下的快速光纤/协程 [英] Fast fibers/coroutines under x64 Windows

查看:69
本文介绍了x64 Windows下的快速光纤/协程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,根据我在此处找到的代码,我有了这个扩展的协程API: https://the8bitpimp.wordpress.com/2014/10/21/coroutines-x64-and-visual-studio/

struct mcontext {
  U64 regs[8];
  U64 stack_pointer;
  U64 return_address;
  U64 coroutine_return_address;
};

struct costate {
   struct mcontext callee;
   struct mcontext caller;
   U32 state;
};

void coprepare(struct costate **token,
       void *stack, U64 stack_size, cofunc_t func); /* C code */
void coenter(struct costate *token, void *arg);     /* ASM code */
void coyield(struct costate *token);                /* ASM code */
int  coresume(struct costate *token);               /* ASM code, new */

我坚持执行coyield(). coyield()可以用C编写,但这是我遇到问题的程序集.这是到目前为止我所得到的(MASM/VC ++语法).

;;; function: void _yield(struct mcontext *callee, struct mcontext *caller)
;;; arg0(RCX): callee token
;;; arg2(RDX): caller token
_yield proc
    lea RBP, [RCX + 64 * 8]
    mov [RCX +  0], R15
    mov [RCX +  8], R14
    mov [RCX + 16], R13
    mov [RCX + 24], R12
    mov [RCX + 32], RSI
    mov [RCX + 40], RDI
    mov [RCX + 48], RBP
    mov [RCX + 56], RBX

    mov R11, RSP
    mov RSP, [RDX + 64]
    mov [RDX + 64], R11

    mov R15, [RDX + 0]
    mov R14, [RDX + 8]
    mov R13, [RDX + 16]
    mov R12, [RDX + 24]
    mov RSI, [RDX + 32]
    mov RDI, [RDX + 40]
    mov RBP, [RDX + 48]    
        mov RBX, [RDX + 56]

    ret
_yield endp

这是8bitpimp代码的直接改编.如果我正确理解此代码,它不会执行的操作是将mcontext-> return_address和mcontext-> coroutine_return_address放在要由ret弹出的堆栈上.还有,那快吗? IIRC,这会导致现代x64片段中发现的返回分支预测变量不匹配.

解决方案

此答案仅解决了问题的快速"部分.

返回地址预测

首先,简要描述典型返回地址预测变量的行为.

  • 每次创建call时,被压入实际堆栈的返回地址也存储在称为返回地址缓冲区之类的CPU结构中.
  • 进行ret(返回)时,CPU假定目标地址将是当前位于返回地址缓冲区顶部的地址,并且弹出"来自返回地址缓冲区的条目.

效果是完美地 1 预测call/ret对,只要它们以它们通常正确的嵌套模式出现并且ret实际上正在删除未推送的未修改返回地址在每种情况下都按call.有关更多详细信息,您可以从此处开始.

使用C或C ++(或几乎任何其他语言)进行的普通函数调用通常将始终遵循此正确嵌套的模式 2 .因此,您无需执行任何特殊操作即可利用收益预测.

故障模式

call/ret无法正常配对的情况下,预测可能会(至少)以几种不同的方式失败:

  • 如果对堆栈指针或堆栈上的返回值进行了操作,以致ret不返回相应的call所压入的位置,则该ret的分支目标预测失败,但随后的通常嵌套的ret指令只要正确嵌套即可继续正确预测.例如,如果在函数中为[rsp]的值添加几个字节以跳过调用函数中call之后的指令,则下一个ret会预测错误,但是后面的ret调用函数内部应该没问题.
  • 另一方面,callret函数未正确嵌套,整个返回预测缓冲区可能会错位,从而导致将来的ret指令(如果有的话)使用现有值进行错误预测 2.5 .例如,如果您call进入一个函数,然后使用jmp返回到调用方,则会出现一个不匹配的call而没有ret.调用者内部的ret会发生错误的预测,调用者内部的ret也会发生错误的预测,依此类推,直到所有未对齐的值用尽或覆盖 3 .如果您的ret与相应的呼叫不匹配,也会发生类似的情况(这种情况对于后续分析很重要).

除了上面的两个规则外,您还可以通过跟踪代码并跟踪每个点的返回堆栈,来简单地确定返回预测变量的行为.每当您有一条ret指令时,请查看它是否返回到返回堆栈的当前顶部-如果没有,则会出现错误预测.

误报成本

错误预测的实际成本取决于周围的代码.通常给出〜20个周期的数字,并且在实践中经常看到,但是实际成本可以较低:例如,如果CPU能够

4 您实际上并未显示coyieldcoresume如何实际调用_yield,因此对于其余的问题,我将假定它们实质上是作为可以直接在coyieldcoresume中,而无需调用_yield:即,将_yield代码复制并粘贴到每个函数中,可以进行一些小的编辑以解决差异.您也可以通过调用_yield来完成这项工作,但是随后又有了一层额外的调用和提示,使分析变得复杂.

5 在某种程度上,这些术语甚至在对称couroutine实现中都有意义,因为在这种情况下,实际上没有调用者和被调用者的绝对概念.

6 当然,此分析仅适用于简单的情况,即您有一个coresume调用通过一个coyield调用进入了协程.更复杂的情况是可能的,例如被叫方内部的多个coyield呼叫,或呼叫方内部的多个coresume呼叫(可能针对不同的协程).但是,使用相同的模式:jmp r11站点分裂的情况将比合并的情况更简单(可能以更多的iBTB资源为代价).

7 一个或两个例外是第一次调用:ret预测变量不需要热身",但间接分支预测变量可能会发生,尤其是在此期间已调用另一个协程的情况下./p>

So I have this coroutine API, extended by me, based on code I found here: https://the8bitpimp.wordpress.com/2014/10/21/coroutines-x64-and-visual-studio/

struct mcontext {
  U64 regs[8];
  U64 stack_pointer;
  U64 return_address;
  U64 coroutine_return_address;
};

struct costate {
   struct mcontext callee;
   struct mcontext caller;
   U32 state;
};

void coprepare(struct costate **token,
       void *stack, U64 stack_size, cofunc_t func); /* C code */
void coenter(struct costate *token, void *arg);     /* ASM code */
void coyield(struct costate *token);                /* ASM code */
int  coresume(struct costate *token);               /* ASM code, new */

I'm stuck on implementing coyield(). coyield() can be written in C, but it's the assembly that I'm having problems with. Here's what I got so far (MASM/VC++ syntax).

;;; function: void _yield(struct mcontext *callee, struct mcontext *caller)
;;; arg0(RCX): callee token
;;; arg2(RDX): caller token
_yield proc
    lea RBP, [RCX + 64 * 8]
    mov [RCX +  0], R15
    mov [RCX +  8], R14
    mov [RCX + 16], R13
    mov [RCX + 24], R12
    mov [RCX + 32], RSI
    mov [RCX + 40], RDI
    mov [RCX + 48], RBP
    mov [RCX + 56], RBX

    mov R11, RSP
    mov RSP, [RDX + 64]
    mov [RDX + 64], R11

    mov R15, [RDX + 0]
    mov R14, [RDX + 8]
    mov R13, [RDX + 16]
    mov R12, [RDX + 24]
    mov RSI, [RDX + 32]
    mov RDI, [RDX + 40]
    mov RBP, [RDX + 48]    
        mov RBX, [RDX + 56]

    ret
_yield endp

This is a straight forward adaption of 8bitpimp's code. What it doesn't do, if I understand this code correctly, is put mcontext->return_address and mcontext->coroutine_return_address on the stack to be popped by the ret. Also, is that fast? IIRC, it causes a mismatch on the return branch predictor found in modern x64 pieces.

解决方案

This answers only addresses the "is it fast" part of the question.

Return Address Prediction

First, a brief description of the behavior of a typical return-address predictor.

  • Every time a call is made, the return address that is pushed on the actual stack is also stored inside a CPU structure called the return address buffer or something like that.
  • When a ret (return) is made, the CPU assumes the destination will be the address currently at the top of the return address buffer, and that entry from return address buffer is "popped".

The effect is to perfectly1 predict call/ret pairs, as long as they occur in their usual properly nested pattern and that ret is actually removing the unmodified return address pushed by call in each case. For more details you can start here.

Normal function calls in C or C++ (or pretty much any other language) will generally always follow this properly nested pattern2. So you don't need to do anything special to take advantage of the return prediction.

Failure Modes

In cases where call/ret aren't paired up normally, the predictions can fail in (at least) a couple of different ways:

  • If the stack pointer or the return value on the stack is manipulated so that a ret doesn't return the place that the corresponding call pushed, you'll get a branch target prediction failure for that ret, but subsequent normally nested ret instructions will continue to predict correctly as long as they are correctly nested. For example, if at function you add a few bytes to the value at [rsp] in order to skip over the instruction following the call in the calling function, the next ret will mispredict, but the ret that follows inside the calling function should be fine.
  • On the other hand, the call and ret functions aren't properly nested, the whole return prediction buffer can become misaligned, causing future ret instructions, if any, that use the existing values to mispredict2.5. For example, if you call into a function, but then use jmp to return to the caller, there is a mismatched call without a ret. The ret inside the caller will mispredict, and so will the ret inside the caller of the caller, and so on, until all misaligned values are used up or overwritten3. A similar case would occur if you had a ret not matched with a corresponding call (and this case is important for the subsequent analysis).

Rather than the two rules above , you can also simply determine the behavior of the return predictor by tracing through the code and tracking what the return stack looks like at each point. Every time you have a ret instruction, see if it returns to the current top of the return stack - if not, you'll get a misprediction.

Misprediction Cost

The actual cost of a misprediction depends on the surrounding code. A figure of ~20 cycles is commonly given and is often seen in practice, but the actual cost can be lower: e.g., as low as zero if the CPU is able to resolve the misprediction early and and start fetching along the new path without interrupting the critical path, or higher: e.g., if the branch prediction failures take a long time to resolve and reduce the effective parallelism of long-latency operations. Regardless we can say that the penalty is usually significant when it occurs in an operation that other takes only a handful of instructions.

Fast Coroutines

Existing Behavior for Coresume and Coyield

The existing _yield (context switch) function swaps the stack pointer rsp and then uses ret to return to a different location than what the actually caller pushed (in particular, it returns to location that was pushed onto the caller stack when the caller called yield earlier). This will generally cause a misprediction at the ret inside _yield.

For example, consider the case where some function A0 makes a normal function call to A1, which it turn calls coresume4 to resume a coroutine B1, which later calls coyield to yield back to A1. Inside the call to coresume, the return stack looks like A0, A1, but then coresume swaps rsp to point to the stack for B1 and the top value of that stack is an address inside B1 immediately following coyield in the code for B1. The ret inside coresume hence jumps to a point in B1, and not to a point in A1 as the return stack expects. Hence you get a mis-prediction on that ret and the return stack looks like A0.

Now consider what happens when B1 calls coyield, which is implemented in basically the same way coresume: the call to coyield pushes B1 on the return stack which now looks like A0, B1 and then swaps the stack to point to A1 stack and then does the ret which will return to A1. So the ret mispredicition will happen in the same way, and the stack is left as A0.

So the bad news is that a tight series of calls to coresume and coyield (as is typical with a yield-based iterator, for example), will mispredict each time. The good news is that now inside A1 at least the return stack is correct (not misaligned) - if A1 returns to its caller A0, the return is correctly predicted (and so on when A0 returns to its caller, etc). So you suffer a mispredict penalty each time, but at least you don't misalign the return stack in this scenario. The relative importance of this depends on how often you are calling coresume/coyield versus calling functions normally in the below the function that is calling coresume.

Making It Fast

So can we fix the misprediction? Unfortunately, it's tricky in the combination of C and external ASM calls, because making the call to coresume or coyield implies a call inserted by the compiler, and it's hard to unwind this in the asm.

Still, let's try.

Use Indirect Calls

One approach is get of using ret at all and just use indirect jumps.

That is, just replace the ret at the end of your coresume and coyield calls with:

pop r11
jmp r11

This is functionally equivalent to ret, but affects the return stack buffer differently (in particular, it doesn't affect it).

If analyze the repeated sequence of coresume and coyield calls as above, we get the result that the return stack buffer just starts growing indefinitely like A0, A1, B1, A1, B1, .... This occurs because in fact we aren't using the ret at all in this implementation. So we don't suffer return mis-predictions, because we aren't using ret! Instead, we rely on the accuracy of the indirect branch predictor to predict jmp11.

How that predictor works depends on how coresume and coyeild are implemented. If they both call a shared _yield function that isn't inlined there is only a single jmp r11 location and this jmp will alternately go to a location in A1 and B1. Most modern indirect predictors will repredict this simple repeating pattern fine, although older ones which only tracked a single location will not. If _yield got inlined into coresume and coyield or you just copy-pasted the code into each function, there are two distinct jmp r11 call sites, each which only see a single location each, and should be well-predicted by any CPU with an indirect branch predictor6.

So this should generally predict a series of tight coyield and coresume calls well7, but at the cost of obliterating the return buffer, so when A1 decides to return to A0 this will be mispredicted as well as subsequent returns by A0 and so on. The size of this penalty is bounded above by the size of the return stack buffer, so if you are making many tight coresume/yield calls this may be a good tradeoff.

That's the best I can think of within the constraint of external calls to functions written in ASM, because you already have an implied call for your co routines, and you have to make the jump to the other couroutine from inside there and I can't see how to keep the stacks balanced and return to the correct location with those constraints.

Inlined Code at the Call Site

If you can inline code at the call-site of your couroutine methods (e.g., with compiler support or inline asm), then you can perhaps do better.

The call to coresume could be inlined as something like this (I've omitted the register saving and restoring code because that's straightforward):

; rcx - current context
; rdc - context for coroutine we are about to resume

; save current non-volatile regs (not shown)
; load non-volatile regs for dest (not shown)
lea r11, [rsp - 8]
mov [rcx + 64], r11 ; save current stack pointer
mov r11, [rdx + 64] ; load dest stack pointer
call [r11]

Note that coresume doens't actually do the stack swap - it just loads the destination stack into r11 and then does a call against [r11] to jump to the coroutine. This is necessary so that that call correctly pushes location we should return to on the stack of the caller.

Then, coyield would look something like (inlined into the calling function):

; save current non-volatile regs (not shown)
; load non-volatile regs for dest (not shown)
lea r11, [after_ret]
push r11             ; save the return point on the stack
mov  rsp, [rdx + 64] ; load the destination stack
ret
after_ret:
mov rsp, r11

When a coresume call jumps to the coroutine it ends up at after_ret, and before executing the user code the mov rsp, r11 instruction swaps to the proper stack for the coroutine which has been stashed in r11 by coresume.

So essentially coyield has two parts: the top half executed before the yield (which occurs at the ret call) and the bottom half which completes the work started by coresume. This allows you to use call as the mechanism to do the coresume jump and ret to do the coyield jump. The call/ret are balanced in this case.

I've glossed over some details of this approach: for example, since there is no function call involved, the ABI-specified non-volatile registers aren't really special: in the case of inline assembly you'll need to indicate to the compiler which variables you will clobber and save the rest, but you can choose whatever set is convenient for you. Choosing a larger set of clobbered variables makes the coresume/coyield code sequences themselves shorter, but potentially puts more register pressure on the surrounding code and may force the compiler to spill more surrounding you code. Perhaps the ideal is just to declare everything clobbered and then the compiler will just spill what it needs.


1 Of course, there are limitations in practice: the size of the return stack buffer is likely limited to some small number (e.g., 16 or 24) so once the depth of the call stack exceeds that, some return addresses are lost and won't be correctly predicted. Also, various events like a context switch or interrupt are likely to mess up the return-stack predictor.

2 An interesting exception was a common pattern for reading the current instruction pointer in x86 (32-bit) code: there is no instruction to do this directly, so instead a call next; next: pop rax sequence can be used: a call to the next instruction which serves only the push the address on the stack which is popped off. There is no corresponding ret. Current CPUs actually recognize this pattern however and don't unbalance the return-address predictor in this special case.

2.5 How many mispredictions this implies depends on how may net returns the calling function does: if it immediately starts calling down another deep chain of calls, the misaligned return stack entries may never be used at all, for example.

3 Or, perhaps, until the return address stack is re-aligned by a ret without a corresponding call, a case of "two wrongs make a right".

4 You haven't actually shown how coyield and coresume actually call _yield, so for the rest of the question I'll assume that they are implemented essentially as _yield is, directly within coyield or coresume without calling _yield: i.e., copy and paste the _yield code into each function, possible with some small edits to account for the difference. You can also make this work by calling _yield, but then you have an additional layer of calls and rets that complicates the analysis.

5 To the extent these terms even make sense in a symmetric couroutine implementation, since there is in fact no absolute notion of caller and callee in that case.

6 Of course, this analysis applies only to the simple case that you have a single coresume call calling into a coroutine with a single coyield call. More complex scenarios are possible, such as multiple coyield calls inside the callee, or multiple coresume calls inside the caller (possibly to different couroutines). However, the same pattern applies: the case with split jmp r11 sites will present a simpler steam than the combined case (possibly at the cost of more iBTB resources).

7 One exception would be the first call or two: the ret predictor needs no "warmup" but the indirect branch predictor may, especially when another coroutine has been called in the interim.

这篇关于x64 Windows下的快速光纤/协程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆