如何确定 CPE:每个元素的周期数 [英] How to determine CPE: Cycles Per Element

查看:62
本文介绍了如何确定 CPE:每个元素的周期数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何确定程序的 CPE?例如,我有一个循环的汇编代码:

How do I determine the CPE of a program? For example, I have this assembly code for a loop:

# inner4: data_t = float
# udata in %rbx, vdata in %rax, limit in %rcx,
# i in %rdx, sum in %xmm1
1 .L87:                                   # loop:
2   movss  (%rbx,%rdx,4), %xmm0           #  Get udata[i]
3   mulss  (%rax,%rdx,4), %xmm0           #  Multiply by vdata[i]
4   addss  %xmm0, %xmm1                   #  Add to sum
5   addq  $1, %rdx                        #  Increment i
6   cmpq  %rcx, %rdx                      #  Compare i:limit
7   jl .L87                               #  If <, goto loop

我必须使用数据类型 float 找到由关键路径确定的 CPE 的下限.我相信关键路径是指最慢的可能路径,因此将是程序必须执行 mulss 指令的路径,因为它占用了最长的时钟周期数.

I have to find the lower bound of the CPE determined by the critical path using the data type float. I believe that the critical path would refer to the slowest possible path, and would thus be the one where the program has to execute the mulss instruction because that takes up the longest number of clock cycles.

然而,似乎没有任何明确的方法来确定 CPE.如果一条指令占用两个时钟周期,另一个占用一个时钟周期,后者能否在前者的第一个时钟周期后启动?任何帮助,将不胜感激.谢谢

However, there doesn't seem to be any clear way to determine the CPE. If one instruction takes two clock cycles, and another takes one, can the latter start after the first clock cycle of the former? Any help would be appreciated. Thanks

推荐答案

如果你想知道它需要多长时间,你应该测量它.执行循环大约 10^10 次,花费它需要的时间并乘以时钟频率.你得到总周期数,除以 10^10 得到每次循环迭代的时钟周期数.

If you want to know how long it needs, you should measure it. Execute the loop some about 10^10 times, take the time it needs and multiply by the clock frequency. You get the total count of cycles, divide by 10^10 to get the number of clock cycles per loop iteration.

对执行时间的理论预测几乎永远是正确的(而且大部分时间是低的),因为决定速度的因素有很多:

A theoretical prediction of the execution time will almost never be correct (and most of time to low) because the are numerous effects which determine the speed:

  • 流水线(流水线中很容易有大约 20 个阶段)
  • 超标量执行(最多5条指令并行,cmpjl可以融合)
  • 解码为 µOps 并重新排序
  • 缓存或内存的延迟
  • 指令的吞吐量(是否有足够的空闲执行端口)
  • 指令的延迟
  • 银行冲突、别名问题和更深奥的东西

根据 CPU 和提供的内存访问都命中 L1 缓存,我相信循环每次迭代应该至少需要 3 个时钟周期,因为最长的依赖链是 3 个元素长.在具有较慢 mulssaddss 指令的旧 CPU 上,所需时间会增加.

Depending on the CPU and provided the memory accesses all hit the L1 cache, I believe the loop should need at least 3 clock cycles per iteration, because the longest dependency chain is 3 elements long. On an older CPU with slower mulss or addss instruction the time needed increases.

如果你真的对加速代码感兴趣,而不仅仅是一些理论观察,你应该将它向量化.您可以使用类似的东西将性能提高 4-8 倍

If you are actually interested in speeding up the code and not only some theoretical observations you should vectorize it. You can increase the performance by a factor of 4-8 with something like

.L87:                               # loop:
vmovdqa (%rbx,%rdx,4), %ymm0        #  Get udata[i]..udata[i+7]
vmulps  (%rax,%rdx,4), %ymm0, %ymm0 #  Multiply by vdata[i]..vdata[i+7]
vaddps  %ymm0, %ymm1, %ymm1         #  Add to sum
addq    $8, %rdx                    #  Increment i
cmpq    %rcx, %rdx                  #  Compare i:limit
jl .L87                             #  If <, goto loop

您需要在此之后水平添加所有 8 个元素,当然还要确保对齐是 32 并且循环计数器可以被 8 整除.

You need to horizontal add all 8 elements after that and of course make sure alignment is 32 and loop counter divisible by 8.

这篇关于如何确定 CPE:每个元素的周期数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆