如何比较同一算法的两个实现?(通过检查他们的汇编代码) [英] How to compare two implementations of the same algorithm? (by examine their Assembly code)

查看:24
本文介绍了如何比较同一算法的两个实现?(通过检查他们的汇编代码)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我在汇编中有两个相同算法的实现.我想通过检查两个片段代码来知道哪个更快.

我认为人们可能会考虑的参数是:操作码数、分支数、函数帧数.

我的问题是:

<块引用>

  1. 我可以假设每个操作码执行是一个周期吗?
  2. 中断管道的分支的开销是多少?
  3. 调用函数的效果和开销是什么?
  4. ARM 和 x86 之间的分析是否存在差异?

这个问题是理论上的,因为我有两个实现;一个是 130 条指令长,一个是 184 条指令长.

而且我想知道 130 条指令长的代码段比 184 条指令长的实现快是否绝对正确?

更好 == 更快"

解决方案

说130指令码比184指令码快,这绝对不是真的.很容易让 1000 条指令的运行速度超过 100 条指令,反之亦然.

<块引用>

1 我可以假设每个操作码执行是一个周期吗?

首先查看广告中的 mips/mhz,尽管这是一个营销数字,但它可以粗略地了解可能的情况.如果数量大于 1,则每个时钟可能有不止一条指令.

<块引用>

2 中断管道的分支开销是多少?

在任一系统上从绝对没有影响到非常显着的影响.一个时钟到数百个是潜在的惩罚.

<块引用>

3 调用函数的效果和开销是什么?

严重依赖于函数,以及调用函数的函数.根据调用约定,您可能必须将寄存器保存到堆栈中,或重新排列寄存器的内容以准备要调用的函数的参数.如果按值传递结构体,则可能需要在堆栈上制作结构体的副本,传递的结构体越大,副本越大.一旦在函数中,可能需要准备一个堆栈帧,等等.这涉及到很多因素.此问答也与平台无关.

<块引用>

4 ARM 和 x86 在分析上有区别吗?

是与否,两个系统都使用流水线、分支预测等所有现代技巧来保持 mips/mhz.ARM 将提供比 x86 更好的每 mhz mips,x86 是可变指令长度可能会为每个单元缓存提供更多指令.你在系统端分析缓存的方式和内存和外围系统的分析大致相同.指令和核心的比较是相似和不同的,这取决于你分析的方面.arm 不是微编码的,x86 可能是这样,所以你真的看不到有多少寄存器,诸如此类.同时 x86 可以更好地查看带有 arm 的内存系统,因为它们通常不是片上系统.根据您购买的 ARM 芯片,您可能会在芯片边界上失去很多可见性,例如,可能看不到所有内存和外围总线.(例如,x86 正在通过现在将 pcie 放在芯片上来改变这种情况)在 cortex 类中的某些东西的情况下,您提到您将具有类似的芯片可见性边缘,因为那些将使用更大/更便宜的基于 DRAM 的片外存储器而不是微控制器比如片上资源.

最后一个问题:

我想知道 130 条指令长的代码段比 184 条指令长的实现速度快的说法是否绝对正确?"

说 130 指令片段比 184 指令片段快的说法绝对是错误的.它可能更快,也可能更慢,并且可能大致相同.有了更多的信息,我们可能能够做出很好的陈述,或者它可能仍然是不确定的.选择执行速度快于 1000 条指令的 100 条指令很容易,也很容易选择执行速度快于 100 条指令的 1000 条指令(即使我不添加分支和循环,只是线性执行)

Assume I have two implementations of the same algorithm in assembly. I would like to know by examining the two snippets codes which one is faster.

The parameters I thought one might take into account are: number of op-codes, number of branches, number of function frames.

My questions are:

  1. Can I assume each opcode execution is one cycle ?
  2. What is the overhead of branch which break the pipeline ?
  3. What are the effects and overhead of calling a function ?
  4. Is there a difference in the analysis between ARM and x86 ?

The question is theoretical since I have two implementations; one 130 instructions long and one is 184 instructions long.

And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?

"BETTER == FASTER"

解决方案

It is definitely not true to say that the 130 instruction code is faster than the 184 instruction code. it is very easy to have 1000 instructions run faster than 100 and vice versa on either of these platforms.

1 Can I assume each opcode execution is one cycle ?

Start by looking at the advertised mips/mhz, although a marketing number it gives a rough idea of what is possible. If the number is greater than one then more than one instruction per clock is possible.

2 What is the overhead of branch which break the pipeline ?

Anywhere from absolutely no affect to a very dramatic affect, on either system. one clock to hundreds are the potential penalty.

3 What are the effects and overhead of calling a function ?

Depends heavily on the function, and the function calling the function. Depending on the calling convention you might have to save registers to the stack, or rearrange the contents of registers to prepare for the parameters for the function to be called. If passing a struct by value a copy of the struct may need to be made on the stack, the bigger the struct passed the bigger the copy. once in the function a stack frame may need to be prepared, etc, etc. There are many factors involved. This question and answer are also independent of platform.

4 Is there a difference in the analysis between ARM and x86 ?

yes and no, both systems use all the modern tricks of pipelining, branch prediction, etc to keep the mips/mhz up. ARM is going to give a better mips per mhz than x86, x86 being variable instruction length might give more instructions per unit cache. How you analyze the cache, and memory and peripheral systems in the systems side of the analysis is roughly the same. The comparison of the instructions and core are similar and different depending on what aspects you are analyzing. The arm is not microcoded, the x86 likely is so you dont really see how many registers there really are, things like that. at the same time the x86 you can get a better look at the memory system with the arm, since they are generally not system on a chip. Depending on what ARM chip you buy you may lose a lot of the visibility in the boundaries of the chip, might not see all the memory and peripheral busses, for example. (x86 is changing that by putting pcie on chip now for example) in the case of something in the cortex-a class you mentioned you would have similar edge of chip visibility as those would use larger/cheaper dram based memory off chip rather than microcontroller like on chip resources.

Bottom line your final question:

"And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?"

It is definitely NOT TRUE to say the 130 instruction snippet is faster than the 184 instruction snippet. It might be faster it might be slower and it might be about the same. With a lot more information we might be able to make a pretty good statement or it may still be non-deterministic. it is easy to choose 100 instructions that execute faster than 1000 instructions and likewise easy to choose 1000 instructions that execute faster than 100 instructions (even if I were to add no branching and no loops, just linear execution)

这篇关于如何比较同一算法的两个实现?(通过检查他们的汇编代码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆