如何确定CPE:周期每元 [英] How to determine CPE: Cycles Per Element

查看:1235
本文介绍了如何确定CPE:周期每元的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何确定一个程序的CPE?
例如,我有这样的组装code的循环:

How do I determine the CPE of a program? For example, I have this assembly code for a loop:

# inner4: data_t = float
# udata in %rbx, vdata in %rax, limit in %rcx,
# i in %rdx, sum in %xmm1
1 .L87:                                   # loop:
2   movss  (%rbx,%rdx,4), %xmm0           #  Get udata[i]
3   mulss  (%rax,%rdx,4), %xmm0           #  Multiply by vdata[i]
4   addss  %xmm0, %xmm1                   #  Add to sum
5   addq  $1, %rdx                        #  Increment i
6   cmpq  %rcx, %rdx                      #  Compare i:limit
7   jl .L87                               #  If <, goto loop

我必须找到下界通过使用数据类型float关键路径确定的CPE的。我认为,关键路径将指向最慢的可能的路径,从而将在其中执行程序有,因为占用的时钟周期数最长执行mulss指令。

I have to find the lower bound of the CPE determined by the critical path using the data type float. I believe that the critical path would refer to the slowest possible path, and would thus be the one where the program has to execute the mulss instruction because that takes up the longest number of clock cycles.

但是,似乎没有被任何明确的方法来确定CPE。如果一条指令需要两个时钟周期,前者的第一个时钟周期先后承接一体,可后者的开端?任何帮助将是AP preciated。谢谢

However, there doesn't seem to be any clear way to determine the CPE. If one instruction takes two clock cycles, and another takes one, can the latter start after the first clock cycle of the former? Any help would be appreciated. Thanks

推荐答案

如果你想知道它需要多长时间,你应该衡量它。执行循环的一些约10 ^ 10次,取它需要的时间和时钟频率倍增。你的周期由10 ^ 10总计数,除以得到每循环迭代的时钟周期的数目。

If you want to know how long it needs, you should measure it. Execute the loop some about 10^10 times, take the time it needs and multiply by the clock frequency. You get the total count of cycles, divide by 10^10 to get the number of clock cycles per loop iteration.

执行时间的理论prediction也就差不多的从不的是正确的(大部分时间为低),因为是决定速度无数影响:

A theoretical prediction of the execution time will almost never be correct (and most of time to low) because the are numerous effects which determine the speed:


  • 流水线(也可以很容易地在管道约20级)

  • 超标量执行(并联时最多5条指令, CMP JL 可以融合)

  • 解码μOps和重新排序

  • 缓存或内存潜伏期

  • 的指令吞吐量(是否有足够的执行端口免费)

  • 的指示潜伏期

  • 银行的冲突,走样的问题,更深奥的东西

  • Pipelining (there can be easily about 20 stages in the pipeline)
  • Superscalar execution (up to 5 instructions in parallel, cmp and jl may be fused)
  • Decoding to µOps and reordering
  • The latencies of Caches or Memory
  • The throughput of the instructions (are there enough executions ports free)
  • The latencies of the instructions
  • Bank conflicts, aliasing issues and more esoteric stuff

根据不同的CPU,并提供了内存访问全部命中的L1缓存,相信环路应该需要每次迭代至少有3个时钟周期,因为最长的依赖链是三个元素长。在与旧的CPU速度较慢 mulss addss 指令所需的时间增加。

Depending on the CPU and provided the memory accesses all hit the L1 cache, I believe the loop should need at least 3 clock cycles per iteration, because the longest dependency chain is 3 elements long. On an older CPU with slower mulss or addss instruction the time needed increases.

如果你是在加快code真正感兴趣的,而不仅仅是一些理论的意见,你应该向量化它。您可以通过4-8的东西,如

If you are actually interested in speeding up the code and not only some theoretical observations you should vectorize it. You can increase the performance by a factor of 4-8 with something like

.L87:                               # loop:
vmovdqa (%rbx,%rdx,4), %ymm0        #  Get udata[i]..udata[i+7]
vmulps  (%rax,%rdx,4), %ymm0, %ymm0 #  Multiply by vdata[i]..vdata[i+7]
vaddps  %ymm0, %ymm1, %ymm1         #  Add to sum
addq    $8, %rdx                    #  Increment i
cmpq    %rcx, %rdx                  #  Compare i:limit
jl .L87                             #  If <, goto loop

您需要到水平添加的所有8个元素之后,当然要确保调整为32,循环计数器被8整除。

You need to horizontal add all 8 elements after that and of course make sure alignment is 32 and loop counter divisible by 8.

这篇关于如何确定CPE:周期每元的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆