x86-64 相对 jmp 性能 [英] x86-64 Relative jmp performance

查看:89
本文介绍了x86-64 相对 jmp 性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在做一项测量各种 x86-64 命令的性能(at&t 语法)的作业.

I'm currently doing an assignment that measures the performance of various x86-64 commands (at&t syntax).

令我有些困惑的命令是无条件 jmp"命令.我是这样实现的:

The command I'm somewhat confused on is the "unconditional jmp" command. This is how I've implemented it:

    .global uncond
uncond:

.rept 10000
jmp . + 2
.endr


mov $10000, %rax
ret

其实很简单.该代码创建了一个名为uncond"的函数,它使用 .rept 指令调用 jmp 命令 10000 次,然后将返回值设置为调用 jmp 命令的次数.

It's fairly simple. The code creates a function called "uncond" which uses the .rept directive to call the jmp command 10000 times, then sets the return value to the number of times you called the jmp command.

".at&t 语法中的意思是当前地址,我增加了 2 个字节以说明 jmp 指令本身(因此 jmp . + 2 应该简单地移动到下一条指令).

"." in at&t syntax means the current address, which I increase by 2 bytes in order to account for the jmp instruction itself (so jmp . + 2 should simply move to the next instruction).

我没有展示的代码计算了处理 10000 个命令所需的周期数.

Code that I haven't shown calculate the number of cycles it takes to process the 10000 commands.

我的结果表明 jmp 非常慢(处理单个 jmp 指令需要 10 个周期) - 但根据我对流水线的理解,无条件跳转应该非常快(没有分支预测错误).

My results say jmp is pretty slow (takes 10 cycles to process a single jmp instruction) - but from what I understand about pipelining, unconditional jumps should be very fast (no branch prediction errors).

我错过了什么吗?我的代码有错吗?

Am I missing something? Is my code wrong?

推荐答案

CPU 没有针对 no-op jmp 指令进行优化,所以它不处理继续解码和流水线jmp指令的特殊情况,只是跳转到下一个insn.

The CPU isn't optimized for no-op jmp instructions, so it doesn't handle the special case of continuing to decode and pipeline jmp instructions that just jump to the next insn.

不过,CPU 已针对循环进行了优化.jmp . 将在许多 CPU 上以每时钟一个 insn 运行,或者在某些 CPU 上每 2 个时钟运行一个.

CPUs are optimized for loops, though. jmp . will run at one insn per clock on many CPUs, or one per 2 clocks on some CPUs.

跳转会在指令获取中产生气泡.一次准确预测的跳跃是可以的,但是除了跳跃什么都不运行是有问题的.我在 core2 E6600 (Merom/Conroe microarch) 上复制了您的结果:

A jump creates a bubble in instruction fetching. A single well-predicted jump is ok, but running nothing but jumps is problematic. I reproduced your results on a core2 E6600 (Merom/Conroe microarch):

# jmp-test.S
.globl _start
_start:

    mov $100000, %ecx
jmp_test:
    .rept 10000
    jmp . + 2
    .endr

    dec %ecx
    jg jmp_test


    mov $231, %eax
    xor %ebx,%ebx
    syscall          #  exit_group(0)

构建和运行:

gcc -static -nostartfiles jmp-test.S
perf stat -e task-clock,cycles,instructions,branches,branch-misses ./a.out

 Performance counter stats for './a.out':

       3318.616490      task-clock (msec)         #    0.997 CPUs utilized          
     7,940,389,811      cycles                    #    2.393 GHz                      (49.94%)
     1,012,387,163      instructions              #    0.13  insns per cycle          (74.95%)
     1,001,156,075      branches                  #  301.679 M/sec                    (75.06%)
           151,609      branch-misses             #    0.02% of all branches          (75.08%)

       3.329916991 seconds time elapsed

来自另一个运行:

 7,886,461,952      L1-icache-loads           # 2377.687 M/sec                    (74.95%)
     7,715,854      L1-icache-load-misses     #    2.326 M/sec                    (50.08%)
 1,012,038,376      iTLB-loads                #  305.119 M/sec                    (75.06%)
           240      iTLB-load-misses          #    0.00% of all iTLB cache hits   (75.02%)

(每行末尾的 (%) 中的数字是计数器处于活动状态的总运行时间的多少:perf 在您要求它计算更多内容时必须为您多路复用)比硬件一次可以计数).

(Numbers in (%) at the end of each line are how much of the total run time that counter was active for: perf has to multiplex for you when you ask it to count more things than the HW can count at once).

所以它实际上并不是 I-cache 未命中,它只是由不断跳转导致的指令获取/解码前端瓶颈.

So it's not actually I-cache misses, it's just instruction fetch/decode frontend bottlenecks caused by constant jumps.

我的 SnB 机器坏了,所以我无法在上面测试数字,但是每 jmp 8 个周期的持续吞吐量与您的结果非常接近(可能来自不同的微架构).

My SnB machine is broken, so I can't test numbers on it, but 8 cycles per jmp sustained throughput is pretty close to your results (which were probably from a different microarchitecture).

有关详细信息,请参阅 http://agner.org/optimize/ 以及来自 标签维基.

For more details, see http://agner.org/optimize/, and other links from the x86 tag wiki.

这篇关于x86-64 相对 jmp 性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆