小型阵列的最快偏移量读取 [英] Fastest Offset Read for a Small Array

查看:114
本文介绍了小型阵列的最快偏移量读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了提高速度,我想读取第9个寄存器中的值所引用的8个寄存器之一. 我看到的最快的方法是使用3个条件跳转(在第9个中检查3位) 登记).与标准的偏移方式相比,这应该具有更短的延迟 读取内存,但这仍然需要至少6个时钟周期(至少一项测试加上一项 每位检查一次有条件的jmp).

For speed, I would like to read one of 8 registers referenced by the value in a 9th register. The fastest way I see to do this is to use 3 conditional jumps (checking 3 bits in the 9th register). This should have shorter latency than the standard way of doing this with an offset memory read, but this still requires at least 6 clock cycles (at least one test plus one conditional jmp per bit check).

是否有任何具有固有功能的商用CPU(最好是x86/x64)来执行偏移寄存器" 读取",只有一个时钟周期的延迟?

Is there any commercial CPU (preferably x86/x64) with an intrinsic to do this "offset register read" with a latency of just one clock cycle?

理论上,经过优化的CPU可以执行一次加法和一次移动来完成此操作,因此需要两个或一个时钟 周期似乎很容易...是否有一些一般原因导致架构不在乎速度 向上读取一个小数组的偏移量?

In theory, an optimized CPU could do this with one addition and one move, so two or one clock cycles seems easy...is there some general reason that architectures don't care about speeding up an offset read for a small array?

推荐答案

如今,将CPU寄存器作为数组进行处理实际上并不常见.我知道允许使用的最后一个体系结构是PDP11,它在80年代后期消失了.您为什么不像其他阵列那样将阵列放入某个内存位置?

Treating the CPU registers as an array is really not a common approach these days. The last architecture I know that allowed this was the PDP11 and it died out in the late 80s. Why don't you put your array into some memory location like any other array?

也就是说,您可以使用计算的跳转.这也将数据依赖关系(索引寻址模式)替换为控件依赖关系,因此乱序的exec不必等待索引输入准备就绪就可以开始运行使用最终RAX的代码.当然,这假设正确分支预测,如果索引经常更改,则不太可能.分支的错误预测会花费很多时间而几乎没有完成任何工作,但是L1d缓存中的负载延迟很小,很容易与独立工作重叠.

That said, you could use a computed jump. This also replaces a data dependency (indexed addressing mode) with a control dependency so out-of-order exec doesn't have to wait for the index input to even be ready before it can start running code that uses the final RAX. Of course this assumes correct branch prediction, which is unlikely if the index changes often. A branch mispredict costs many cycles of little work being done, but the small latency of a load that hits in L1d cache can overlap with independent work very easily.

吞吐成本高于内存中的数组:某些地址计算,一次跳转,一步移动和一个ret,而不只是一个具有索引寻址模式的mov甚至是内存操作数.

The throughput cost is higher than an array in memory: some address computations, one jump, one move and a ret, instead of just a mov or even a memory operand with an indexed addressing mode.

要内联此代码,只需将jmp *%rax替换为call *%rax,这会花费另一个uop.或将ret指令替换为jmp到底部的标签,然后将跳转表的跨度增加到8以说明更长的编码.

To inline this code, simply replace the jmp *%rax with a call *%rax, costing another uop. Or replace the ret instructions with a jmp to a label at the bottom and increase the stride of the jump table to 8 to account for the longer encoding.

    # select a register from r8...r15 according to the value in rdi
select:
    lea labels-4*8(%rip),%rax # rdi = 8 is the first jump table entry
    lea (%rax,%rdi,4),%rax    # pointer to the appropriate entry
    jmp *%rax                 # computed jump

    .align 4
labels:
    mov %r8, %rax
    ret

    .align 4
    mov %r9, %rax
    ret

    .align 4
    mov %r10, %rax
    ret

    .align 4
    mov %r11, %rax
    ret

    .align 4
    mov %r12, %rax
    ret

    .align 4
    mov %r13, %rax
    ret

    .align 4
    mov %r14, %rax
    ret

    .align 4
    mov %r15, %rax
    ret

虽然这可能比三个条件跳转(取决于访问模式)要快,但它肯定不会仅使用数组就胜过.

While this is probably faster than three conditional jumps (depending on access pattern), it surely won't beat just using an array.

这篇关于小型阵列的最快偏移量读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆