什么是 IACA 以及如何使用它? [英] What is IACA and how do I use it?

查看:38
本文介绍了什么是 IACA 以及如何使用它?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现了这个有趣且强大的工具,名为 IACA(英特尔架构代码分析器),但我无法理解它.我可以用它做什么,它的局限性是什么,我该怎么做:

I've found this interesting and powerful tool called IACA (the Intel Architecture Code Analyzer), but I have trouble understanding it. What can I do with it, what are its limitations and how can I:

  • 用它来分析 C 或 C++ 中的代码?
  • 用它来分析 x86 汇编器中的代码?

推荐答案

2019-04:达到EOL.建议的替代方案:LLVM-MCA

2017-11:3.0 版本发布(最新于 2019 年 5 月 18 日)

2017-03:发布2.3版本

它是什么:

IACA(英特尔架构代码分析器)是英特尔开发的(2019:停产)免费软件、闭源静态分析工具,用于静态分析现代英特尔处理器执行时的指令调度.这允许它计算给定的片段,

What it is:

IACA (the Intel Architecture Code Analyzer) is a (2019: end-of-life) freeware, closed-source static analysis tool made by Intel to statically analyze the scheduling of instructions when executed by modern Intel processors. This allows it to compute, for a given snippet,

  • 吞吐量模式中,最大吞吐量(假定代码片段是最内层循环的主体)
  • 延迟模式中,从第一条指令到最后一条指令的最小延迟.
  • 跟踪模式下,打印指令通过其流水线阶段的进度.
  • In Throughput mode, the maximum throughput (the snippet is assumed to be the body of an innermost loop)
  • In Latency mode, the minimum latency from the first instruction to the last.
  • In Trace mode, prints the progress of instructions through their pipeline stages.

假设最佳执行条件(所有内存访问都命中 L1 缓存并且没有页面错误).

when assuming optimal execution conditions (All memory accesses hit L1 cache and there are no page faults).

IACA 支持 Nehalem、Westmere、Sandy Bridge、Ivy Bridge、Haswell、Broadwell 和 Skylake 处理器(自 2.3 版起)和 Haswell、Broadwell 和 Skylake(自 3.0 版起)的计算调度.

IACA supports computing schedulings for Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell and Skylake processors as of version 2.3 and Haswell, Broadwell and Skylake as of version 3.0.

IACA 是一个命令行工具,可生成 ASCII 文本报告和 Graphviz 图.2.1及以下版本支持32位和64位Linux、Mac OS X和Windows以及32位和64位代码分析;2.2 及以上版本仅支持 64 位操作系统和 64 位代码分析.

IACA is a command-line tool that produces ASCII text reports and Graphviz diagrams. Versions 2.1 and below supported 32- and 64-bit Linux, Mac OS X and Windows and analysis of 32-bit and 64-bit code; Version 2.2 and up only support 64-bit OSes and analysis of 64-bit code.

IACA 的输入是您代码的编译二进制文件,其中注入了两个标记:开始标记结束标记.标记使代码无法运行,但允许工具快速找到相关代码段并对其进行分析.

IACA's input is a compiled binary of your code, into which have been injected two markers: a start marker and an end marker. The markers make the code unrunnable, but allow the tool to find quickly the relevant pieces of code and analyze them.

您不需要能够在您的系统上运行二进制文件;事实上,提供给 IACA 的二进制文件无法运行,因为代码中存在注入的标记.IACA 只需要能够读取要分析的二进制文件.因此,可以使用 IACA 在 Pentium III 机器上使用 FMA 指令分析 Haswell 二进制文件.

You do not need the ability to run the binary on your system; In fact, the binary supplied to IACA can't run anyways because of the presence of the injected markers in the code. IACA only requires the ability to read the binary to be analyzed. Thus it is possible, using IACA, to analyze a Haswell binary employing FMA instructions on a Pentium III machine.

在 C 和 C++ 中,可以使用 #include "iacaMarks.h" 访问标记注入宏,其中 iacaMarks.h 是随附的标头include/ 子目录中的工具.

In C and C++, one gains access to marker-injecting macros with #include "iacaMarks.h", where iacaMarks.h is a header that ships with the tool in the include/ subdirectory.

然后在感兴趣的最里面循环或感兴趣的直线块周围插入标记,如下所示:

One then inserts the markers around the innermost loop of interest, or the straight-line chunk of interest, as follows:

/* C or C++ usage of IACA */

while(cond){
    IACA_START
    /* Loop body */
    /* ... */
}
IACA_END

然后重新构建应用程序,就像在启用优化的情况下一样(对于 Visual Studio 等 IDE 的用户在发布模式下).输出是一个二进制文件,除了存在使应用程序不可运行的标记外,它在所有方面都与发布版本相同.

The application is then rebuilt as it otherwise would with optimizations enabled (In Release mode for users of IDEs such as Visual Studio). The output is a binary that is identical in all respects to the Release build except with the presence of the marks, which make the application non-runnable.

IACA 依赖于编译器不会过度重新排序标记;因此,对于此类分析构建,如果重新排序标记以包含不在最内层循环中的无关代码或排除其中的代码,则可能需要禁用某些强大的优化.

IACA relies on the compiler not reordering the marks excessively; As such, for such analysis builds certain powerful optimizations may need to be disabled if they reorder the marks to include extraneous code not within the innermost loop, or exclude code within it.

IACA 的标记是在代码中的正确位置注入的魔术字节模式.在 C 或 C++ 中使用 iacaMarks.h 时,编译器会处理在正确位置插入标头指定的魔术字节.但是,在装配中,您必须手动插入这些标记.因此,必须执行以下操作:

IACA's markers are magic byte patterns injected at the correct location within the code. When using iacaMarks.h in C or C++, the compiler handles inserting the magic bytes specified by the header at the correct location. In assembly, however, you must manually insert these marks. Thus, one must do the following:

    ; NASM usage of IACA
    
    mov ebx, 111          ; Start marker bytes
    db 0x64, 0x67, 0x90   ; Start marker bytes
    
.innermostlooplabel:
    ; Loop body
    ; ...
    jne .innermostlooplabel ; Conditional branch backwards to top of loop

    mov ebx, 222          ; End marker bytes
    db 0x64, 0x67, 0x90   ; End marker bytes

对于 C/C++ 程序员来说,编译器实现相同的模式至关重要.

It is critical for C/C++ programmers that the compiler achieve this same pattern.

举个例子,让我们分析一下以下 Haswell 架构上的汇编程序示例:

As an example, let us analyze the following assembler example on the Haswell architecture:

.L2:
    vmovaps         ymm1, [rdi+rax] ;L2
    vfmadd231ps     ymm1, ymm2, [rsi+rax] ;L2
    vmovaps         [rdx+rax], ymm1 ; S1
    add             rax, 32         ; ADD
    jne             .L2             ; JMP

我们在 .L2 标签之前添加开始标记,在 jne 之后添加结束标记.然后我们重建软件,并因此调用 IACA(在 Linux 上,假设 bin/ 目录在路径中,foo 是一个包含 IACA 标记的 ELF64 对象):

We add immediately before the .L2 label the start marker and immediately after jne the end marker. We then rebuild the software, and invoke IACA thus (On Linux, assumes the bin/ directory to be in the path, and foo to be an ELF64 object containing the IACA marks):

iaca.sh -64 -arch HSW -graph insndeps.dot foo

,从而在 Haswell 处理器上运行时生成 64 位二进制 foo 的分析报告,以及可使用 Graphviz 查看的指令依赖关系图.

, thus producing an analysis report of the 64-bit binary foo when run on a Haswell processor, and a graph of the instruction dependencies viewable with Graphviz.

报告被打印到标准输出(尽管它可能被定向到一个带有 -o 开关的文件).为上述片段给出的报告是:

The report is printed to standard output (though it may be directed to a file with a -o switch). The report given for the above snippet is:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles       Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.5  | 1.5    1.0  | 1.5    1.0  | 1.0  | 0.0  | 1.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
|   2    | 0.5       | 0.5 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | add rax, 0x20
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffffec
Total Num Of Uops: 6

该工具有用地指出,目前,瓶颈是 Haswell 前端以及端口 2 和 3 的 AGU.此示例使我们能够将问题诊断为端口 7 未处理存储,并采取补救措施.

The tool helpfully points out that currently, the bottleneck is the Haswell frontend and Port 2 and 3's AGU. This example allows us to diagnose the problem as the store not being processed by Port 7, and take remedial action.

IACA 不支持某些指令,在分析中忽略了这些指令.它不支持比 Nehalem 更旧的处理器,也不支持吞吐量模式下的非最内层循环(无法猜测哪个分支被采用的频率和模式).

IACA does not support a certain few instructions, which are ignored in the analysis. It does not support processors older than Nehalem and does not support non-innermost loops in throughput mode (having no ability to guess which branch is taken how often and in what pattern).

这篇关于什么是 IACA 以及如何使用它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆