X86-64装配性能的优化 - 对齐和分支prediction [英] Performance optimisations of x86-64 assembly - Alignment and branch prediction

查看:484
本文介绍了X86-64装配性能的优化 - 对齐和分支prediction的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前编码的一些C99标准库字符串函数高度优化的版本,如的strlen() memset的()等,采用x86-64的组装与SSE-2指令。

I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen(), memset(), etc, using x86-64 assembly with SSE-2 instructions.

到目前为止,我设法在性能方面优异的成绩,但我有时会收到怪异的行为,当我尝试优化了。

So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more.

例如,添加或删除,甚至一些简单的指令,或者干脆用重组跳跃使用完全降解的整体表现一些地方的标签。还有的在code而言绝对没有理由。

For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely no reason in terms of code.

所以我的猜测是,有一些问题,code对齐,和/或用树枝当中去MIS predicted。

So my guess is that there is some issues with code alignment, and/or with branches which get mispredicted.

我知道,即使采用相同的架构(X86-64),不同的CPU有分支prediction不同的算法。

I know that, even with the same architecture (x86-64), different CPUs have different algorithms for branch prediction.

但是,有一些一般性的建议,对x86-64的高性能,约code路线和分支prediction开发时?

在特别是有关对齐,我应该确保跳转指令所使用的所有标签都在一个DWORD对齐?

In particular, about alignment, should I ensure all labels used by jump instructions are aligned on a DWORD?

_func:
    ; ... Some code ...
    test rax, rax
    jz   .label
    ; ... Some code ...
    ret
    .label:
        ; ... Some code ...
        ret

在previous code,我应该使用 .label前ALIGN指令:,如:

In the previous code, should I use an align directive before .label:, like:

align 4
.label:

如果是这样,它足以在一个DWORD当使用对齐SSE-2?

If so, is it enough to align on a DWORD when using SSE-2?

和有关分支prediction,有没有«preffered»的方式来组织通过跳转指令使用的标签,以帮助CPU,或者是今天的CPU足够聪明,通过计算来确定在运行时次分支取多少?

And about branch prediction, is there a «preffered» way to organize the labels used by jump instructions, in order to help the CPU, or are today's CPUs smart enough to determine that at runtime by counting the number of times a branch is taken?

修改

好吧,这里是一个具体的例子 - 这里的)开始的strlen(与SSE-2:

Ok, here's a concrete example - here's the start of strlen() with SSE-2:

_strlen64_sse2:
    mov         rsi,    rdi
    and         rdi,    -16
    pxor        xmm0,   xmm0
    pcmpeqb     xmm0,   [ rdi ]
    pmovmskb    rdx,    xmm0
    ; ...

以1000字符串运行它10'000'000次给出约0.48秒,这是好的。结果
不过,这并不检查空字符串输入。所以,很显然,我将添加一个简单的检查:

Running it 10'000'000 times with a 1000 character string gives about 0.48 seconds, which is fine.
But it does not check for a NULL string input. So obviously, I'll add a simple check:

_strlen64_sse2:
    test       rdi,    rdi
    jz          .null
    ; ...

同样的测试,现在跑0.59秒。但是,如果我对准code此检查后:

Same test, it runs now in 0.59 seconds. But if I align the code after this check:

_strlen64_sse2:
    test       rdi,    rdi
    jz          .null
    align      8
    ; ...

最初的表演都回来了。我用8对齐,如图4不会改变任何东西。结果
谁能解释这一点,并提供有关何时调整,或不对齐code节提出了一些建议?

The original performances are back. I used 8 for alignment, as 4 doesn't change anything.
Can anyone explain this, and give some advices about when to align, or not to align code sections?

编辑2

当然,它不是每一个分支目标对准一样简单。如果我这样做,演出通常会变得更糟,除非一些特殊情况下,像上面。

Of course, it's not as simple as aligning every branch target. If I do it, performances will usually get worse, unless some specific cases like above.

推荐答案

利用其3 PARAMS资助细粒度控制。

Alignment optimisations

1. Use .p2align <abs-expr> <abs-expr> <abs-expr> instead of align.

Grants fine-grained control using its 3 params


  • 参数1 的 - 对齐到什么边界

  • 参数2 的 - 装满填充什么(零或 NOP S)

  • 参数3 的 - 不要对准如果填充将超过字节指定数量

  • param1 - Align to what boundary.
  • param2 - Fill padding with what (zeroes or NOPs).
  • param3 - Do NOT align if padding would exceed specified number of bytes.

  • 这增加了机会,整个code块位于一个高速缓存行。一旦装载到L1高速缓存,则可以在无需访问RAM,用于取指令完全执行。这是对于具有大量迭代环路非常有益的。

  /* nop */
  static const char nop_1[] = { 0x90 };

  /* xchg %ax,%ax */
  static const char nop_2[] = { 0x66, 0x90 };

  /* nopl (%[re]ax) */
  static const char nop_3[] = { 0x0f, 0x1f, 0x00 };

  /* nopl 0(%[re]ax) */
  static const char nop_4[] = { 0x0f, 0x1f, 0x40, 0x00 };

  /* nopl 0(%[re]ax,%[re]ax,1) */
  static const char nop_5[] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };

  /* nopw 0(%[re]ax,%[re]ax,1) */
  static const char nop_6[] = { 0x66, 0x0f, 0x1f, 0x44, 0x00, 0x00 };

  /* nopl 0L(%[re]ax) */
  static const char nop_7[] = { 0x0f, 0x1f, 0x80, 0x00, 0x00, 0x00, 0x00 };

  /* nopl 0L(%[re]ax,%[re]ax,1) */
  static const char nop_8[] =
    { 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00};

  /* nopw 0L(%[re]ax,%[re]ax,1) */
  static const char nop_9[] =
    { 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };

  /* nopw %cs:0L(%[re]ax,%[re]ax,1) */
  static const char nop_10[] =
    { 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };

(高达 10byte NOP S代表的x86源<一个href=\"https://android.googlesource.com/toolchain/binutils/+/f226517827d64cc8f9dccb0952731601ac13ef2a/binutils-2.23/bfd/cpu-i386.c\">binutils-2.2.3.)

(upto 10byte NOPs for x86. Source binutils-2.2.3.)

<子>
x86_64的微架构/代际之间的变化很多。然而,对于所有的人都适用的准则的共同集可以被总结如下。 参考:瓦格纳雾的x86微架构第3节手动

Lot of variations between x86_64 micro-architectures/generations. However a common set of guidelines that are applicable for all of them can be summarised as follows. Reference : Section 3 of Agner Fog's x86 micro-architecture manual.


  • 远跳跃不会predicted即管道摊位总是在条件远跳。


  • 循环检测逻辑,保证只对是&LT循环工作; 64 迭代。这是由于一个分支指令被认为具有,如果去的一种方式的循环行为的事实N-1 次,然后去其他方式的 1 时间,任何 N 高达64。

  • Loop detection logic is guaranteed to work ONLY for loops with < 64 iterations. This is due to the fact that a branch instruction is recognized as having loop behavior if it goes one way n-1 times and then goes the other way 1 time, for any n upto 64.

这篇关于X86-64装配性能的优化 - 对齐和分支prediction的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆