使用 ARM 周期计数后处理 `objdump --disassemble` [英] Post process `objdump --disassemble` with ARM cycle counts
问题描述
是否有脚本可用于对某些 objdump --disassemble
输出进行后期处理以使用循环计数进行注释?特别是对于 ARM 系列.大多数情况下,这只是一个模式匹配和一个表查找计数.我猜可能需要像 Perl、python、bash、+5M
这样的注释,用于五个内存周期.C
等都很好.我认为这可以通用,但我对 ARM 感兴趣,它有一个 正交 指令集.这是一个关于 68HC11 做同样事情的线程.该脚本需要一个 CPU 型号 选项来选择适当的周期计数;我认为这些计数已经存在于 gcc
机器描述中.
Is there a script available for post processing some objdump --disassemble
output to annotate with cycle counts? Especially for the ARM family. Most of the time this would only be a pattern match with a table lookup for the count. I guess annotations like Perl, python, bash, +5M
for five memory cycles might be needed.C
, etc are fine. I think this can be done generically, but I am interested in the ARM, which has an orthogonal instruction set. Here is a thread on the 68HC11 doing the same thing. The script would need an CPU model option to select the appropriate cycle counts; I think these counts already exist in the gcc
machine description.
我不认为有一个 objdump
开关,但是 RTFM 会很棒.
I don't think there is an objdump
switch for this, but RTFM would be great.
澄清一下,诸如最佳情况内存子系统以及从缓存中执行代码时的情况等假设是可以的.目标不是按照某些跑步机进行 100% 准确的循环计数.有可能得到一个合理的估计,否则编译器设计是不可能的.
To clarify, assumptions such as best case memory sub-system as will be the case when the code executes from cache are fine. The goal is not a 100% accurate cycle count as per some running machine. It is possible to get a reasonable estimate, otherwise compiler design would be impossible.
正如 DWelch 所指出的那样,使用深度流水线架构(如更新的 Cortex 芯片)不可能获得简单的运行总数.objdump
后处理必须查看周围的操作码.gcc 插件更有可能实现这一点,因为这是新的(4.5+),我认为不存在这样的事情.ARM926 的脚本当然是可能的,而且相当简单.
As DWelch points out, a simple running total is not possible with deep pipelined architecture, like more recent Cortex chips. The objdump
post processing would have to look at surrounding opcodes. A gcc plug-in is more likely to be able to accomplish this and as that is new (4.5+), I don't think such a thing exists. A script for the ARM926 is certainly possible and fairly simple.
内存延迟无关紧要.内存控制器就像另一个CPU
.当 CPU 执行算术等操作时,它正在做它的工作.一个好的/经过良好调整的算法将 parallel 内存访问与计算.通过计算负载/存储和周期,您可以确定当您使用计时器主动分析时完成了多少并行度.由于寄存器之间的互锁,流水线很重要,但是基本块的循环计数可以可靠地甚至在现代 ARM 处理器上计算和使用;这对于一个简单的脚本来说太复杂了.
The memory latency doesn't matter. The memory controller is like another CPU
. It is doing it's business while the CPU is doing arithmetic, etc. A good/well tuned algorithm will parallel the memory accesses with the computations. By counting loads/store and cycles you can determine how much parallelism is accomplished, when you actively profile with a timer. The pipeline is significant due to interlocks between registers, but a cycle count for basic blocks can reliably be calculated and used even on modern ARM processors; this is too complex for a simple script.
推荐答案
有一个在线工具 估计 Cortex-A8 上的周期数.然而,这个 CPU 已经很老了,针对它优化的程序在较新的 CPU 上可能不是最理想的.
There is an online tool which estimates cycle counts on Cortex-A8. However, this CPU is quite old, and programs optimized for it might be suboptimal on newer CPUs.
AFAIK ARM 还提供 Cortex-A9 和 Cortex-A5 cycle- RVDS 软件中有精确的模拟器,但价格相当昂贵.
AFAIK ARM also provides Cortex-A9 and Cortex-A5 cycle-accurate emulators in their RVDS software, but it is quite expensive.
这篇关于使用 ARM 周期计数后处理 `objdump --disassemble`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!