数学函数在现代处理器上占用了多少周期 [英] How much cycles math functions take on modern processors

查看:155
本文介绍了数学函数在现代处理器上占用了多少周期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们知道现代处理器直接在处理器上执行诸如 cosine sin 之类的指令,因为它们具有操作码它。我的问题是这些指令通常需要多少周期。他们需要不变的时间还是依赖于输入参数?

解决方案

现代处理器的指令周期很久以前很难。现在的处理器包含多个执行内核,它们的操作可能会重叠并且可能无序执行。



Intel处理器手册中给出了一个很好的基本考虑事例,第4卷,附录C.它通过延迟和吞吐量来分解指令时序。延迟是执行内核完成微操作所需的周期数。吞吐量是让执行单元再次接受相同指令所需的周期数。吞吐量通常低于延迟,包括表中的小数部分。具有多个相同类型的执行单元的副作用。这个类型很重要,它告诉你指令是否可以重叠。



也许你在这里得到了重要的信息:它很大程度上取决于其他指令您对时间感兴趣的代码。那些其他指令可能与昂贵的指令同时执行。在这一点上,他们有效地采取了0个周期。或者他们可能不会,因为执行单元忙于先前的指令而拖延管道。



手册中的一些示例数据挑选了表格中最现代化的核心:


$ b $ ul
$ FMUL,latency = 7,throughput = 2,FP_MUL执行单元
  • FDIV,latency = 6,吞吐量= 5,未指定单元
    FSQRT,延迟= 38,吞吐量= 43,FP_DIV执行单元



    更好的SIMD指令。



    只有有意义的事情是 measure ,而不是假设。


    We know that modern processors execute instructions such as cosine and sin directly on the processor as they have opcodes for it. My question is how much cycles these instructions normally take. Do they take constant time or depend upon input parameters?

    解决方案

    Talking about "cycles for an instruction" for modern processors got to be difficult quite a while ago. Processors these days contain multiple execution cores, their operation can overlap and can execute out-of-order.

    A good example of the essential consideration is given in the Intel processor manual, volume 4, appendix C. It breaks down instruction timing by Latency and Throughput. Latency is the number of cycles an execution core requires to complete a micro-op. Throughput is the number of cycles required to have the execution unit accept the same instruction again. Throughput is generally lower than Latency, including having fractional values in the table. A side-effect of having more than one execution unit of the same type. The type is important, that tells you whether instructions can overlap.

    Maybe you got the essential message here: it greatly depends what other instructions surround the code you are interested in timing. Those other instructions may well execute concurrently with the expensive one. At which point they take, effectively, 0 cycles. Or they may not, stalling the pipeline because the execution unit is busy with a previous instruction. The kind of details that programmers that write code optimizers care a lot about.

    Some sample data from the manual, picking the most modern core in the tables:

    • FMUL, latency = 7, throughput = 2, FP_MUL execution unit
    • FDIV, latency = 6, throughput = 5, unspecified unit
    • FSQRT, latency = 38, throughput = 43, FP_DIV exeution unit
    • FSIN, latency = 160-180, throughput = 130, unspecified unit

    A much better bang on SIMD instructions.

    The only meaningful thing to do is measure, not assume.

    这篇关于数学函数在现代处理器上占用了多少周期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆