(如何)可以使用LLVM机器代码分析器预测代码片段的运行时间? [英] (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?

查看:201
本文介绍了(如何)可以使用LLVM机器代码分析器预测代码片段的运行时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用llvm-mca计算了一段代码的总周期,以为他们可以预测其运行时间.但是,动态测量运行时几乎没有相关性.因此:为什么llvm-mca计算的总周期不能准确预测运行时间?我可以使用llvm-mca更好地预测运行时间吗?

I used llvm-mca to compute the total cycles of a pice of code, thinking they would predict its runtime. However, measuring the runtime dynamically showed almost no correlation. So: Why does the total cycles computed by llvm-mca not accurately predict the runtime? Can I predict the runtime in some better way with llvm-mca?

详细信息:

我想知道以下代码针对不同类型的begin(和end)迭代器的运行时间,对于startValue0.00ULL:

I wanted to know the runtime of the following code for different types of begin (and end) iterators, for startValue being 0.0 or 0ULL:

std::accumulate(begin, end, starValue)

为了预测运行时间,我使用了Compiler Explorer( https://godbolt.org/z/5HDzSF)及其LLVM机器代码分析器(llvm-mca)插件,因为llvm-mca是一种性能分析工具,它使用LLVM中可用的信息(例如调度模型)来静态地测量性能".我使用了以下代码:

To predict the runtimes, I used the Compiler Explorer (https://godbolt.org/z/5HDzSF) with its LLVM Machine Code Analyzer (llvm-mca) plugin, since llvm-mca is "a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance". I used the following code:

using vec_t = std::vector<double>;

vec_t generateRandomVector(vec_t::size_type size)
{
    std::random_device rnd_device;
    std::mt19937 mersenne_engine {rnd_device()};
    std::uniform_real_distribution dist{0.0,1.1};
    auto gen = [&dist, &mersenne_engine](){
        return dist(mersenne_engine);
    };
    vec_t result(size);
    std::generate(result.begin(), result.end(), gen);
    return result;
}

double start()
{
    vec_t vec = generateRandomVector(30000000);
    vec_t::iterator vectorBegin = vec.begin();
    vec_t::iterator vectorEnd = vec.end();
    __asm volatile("# LLVM-MCA-BEGIN stopwatchedAccumulate");
    double result = std::accumulate(vectorBegin, vectorEnd, 0.0);
    __asm volatile("# LLVM-MCA-END");    
    return result;
}

但是,我看不到llvm-mca的计算机总周期数与运行相应的std :: accumulate的挂钟时间之间的相关性.例如,在上面的代码中,总周期为2806,运行时间为14ms.当我切换到startValue 0ULL时,总周期为2357,但运行时为117ms.

However, I see no correlation between the total cycles computer by llvm-mca and the wall clock time from running the corresponding std::accumulate. For example, in the code above, the Total Cycles are 2806, the runtime is 14ms. When I switch to the startValue 0ULL, the Total Cycles are 2357, but the runtime is 117ms.

推荐答案

TL:DR:LLVM-MCA分析了这些注释之间的整个代码块,就好像它们是 body 一样.循环,并向您显示所有这些指令的100次迭代的周期数.

TL:DR: LLVM-MCA analyzed the entire chunk of code between those comments as if it were the body of a loop, and showed you the cycle count for 100 iterations of all those instructions.

但是,除了实际的(微小)循环外,大多数指令都是循环设置,而循环之后的SIMD水平和实际上只运行一次. (这就是为什么使用double累加器的0.0版本的周期计数为数千,而不是400 = 100乘以Skylake上vaddpd的4周期延迟的原因.)

But as well as the actually (tiny) loop, most of the instructions are loop setup and the SIMD horizontal sum after the loop that actually only run once. (That's why the cycle count is in the thousands, not 400 = 100 time the 4-cycle latency of vaddpd on Skylake for the 0.0 version with a double accumulator.)

如果您取消选中Godbolt编译器资源管理器上的"//"框,或修改asm语句以添加"nop # LLVM-MCA-END"之类的nop,则可以在asm窗口中找到这些行,并查看LLVM- MCA正在查看它的循环".

If you uncheck the "//" box on the Godbolt compiler explorer, or modify the asm statements to add a nop like "nop # LLVM-MCA-END", you'll be able to find those lines in the asm window and see what LLVM-MCA was looking at as it's "loop".

LLVM MCA模拟指定的汇编指令序列,并计算在指定的目标体系结构上执行每次迭代所需的周期数. LLVM MCA进行了许多简化,例如(超出我的脑袋):(1)假定所有条件分支均落空,(2)假定所有内存访问均属于Write Back内存类型,并且所有命中L1高速缓存,(3)假定前端工作最佳,并且(4)call指令不遵循被调用过程,它们只是掉线了.我还无法回忆起其他假设.

LLVM MCA simulates a specified sequence of assembly instructions and calculates the number of cycles it's estimated to take to execute per iteration on the specified target architecture. LLVM MCA makes a number of simplifications such as (off the top of my head): (1) it assumes that all conditional branches fall through, (2) it assumes that all memory accesses are of the Write Back memory type and all hit in the L1 cache, (3) it assumes that the frontend works optimally, and (4) call instructions are not followed into the called procedure and they just fall through. There are other assumptions as well that I can't recall at the moment.

从本质上讲,LLVM MCA(如Intel IACA)仅适用于后端计算绑定的简单循环.在IACA中,虽然支持大多数指令,但并未详细建模一些指令.作为示例,假定预取指令仅消耗微体系结构资源,但基本上占用零延迟,并且对存储器层次结构的状态没有影响.在我看来,MCA完全忽略了此类指示.无论如何,这与您的问题并不特别相关.

Essentially, LLVM MCA (like Intel IACA) is only accurate for backend-compute-bound, simple loops. In IACA, while most instructions are supported, a few instructions are not modeled in detail. As an example, prefetch instructions are assumed to consume only microarchitectural resources but take basically zero latency and have no impact on the state of the memory hierarchy. It appears to me that MCA completely ignores such instructions, however. Anyway, this is not particularly relevant to your question.

现在返回您的代码.在提供的Compiler Explorer链接中,您没有将任何选项传递给LLVM MCA.因此,默认的目标体系结构即会生效,无论该工具在其上运行的是哪种体系结构.这恰好是SKX.您提到的周期总数是针对SKX的,但是尚不清楚是否在SKX上运行了代码.您应该使用-mcpu选项指定体系结构.这与传递给gcc的-march是独立的.还要注意,将核心周期与毫秒进行比较是没有意义的.您可以使用RDTSC指令以核心周期来衡量执行时间.

Now back to your code. In the Compiler Explorer link you have provided, you didn't pass any options to LLVM MCA. So the default target architecture takes effect, which is whatever architecture the tool is running on. This happens to be SKX. The total number of cycles you mentioned are for SKX, but it's not clear whether you ran the code on SKX or not. You should use the -mcpu option to specify the architecture. This is independent from the -march you passed to gcc. Also note that comparing core cycles to milliseconds is not meaningful. You can use the RDTSC instruction to measure the execution time in terms of core cycles.

注意编译器如何内联对std::accumulate的调用.显然,此代码在汇编行405处开始,而std::accumulate的最后一条指令在444行处,总共38条指令. LLVM MCA估计与实际性能不匹配的原因现已清楚.该工具假定所有这些指令都在循环中执行了大量迭代.显然不是这样. 420-424之间只有一个循环:

Notice how the compiler has inlined the call to std::accumulate. Apparently, this code starts at assembly line 405 and the last instruction of the std::accumulate is at line 444, a total of 38 instructions. The reason why the LLVM MCA estimation will not match the actual performance has become clear now. The tool is assuming that all of these instructions are being executed in a loop for a large number of iterations. Obviously this is not the case. There is only one loop from 420-424:

.L75:
        vaddpd  ymm0, ymm0, YMMWORD PTR [rax]
        add     rax, 32
        cmp     rax, rcx
        jne     .L75

仅此代码应作为MCA的输入.在源代码级别,实际上没有办法告诉MCA仅分析此代码.您必须手动内联std::accumulate并将LLVM-MCA-BEGINLLVM-MCA-END标记放在其内部.

Only this code should be the input to MCA. At the source code level, there is really no way to tell MCA to only analyze this code. You'd have to manually inline std::accumulate and place the LLVM-MCA-BEGIN and LLVM-MCA-END marks somewhere inside it.

当将0ULL而不是0.0传递给std::accumulate时,LLVM MCA的输入将从汇编指令402开始并在441结束.请注意,MCA不支持的任何指令(例如vcvtsi2sdq)都将从分析中完全省略.实际在循环中的代码部分是:

When passing 0ULL instead of 0.0 to std::accumulate, the input to LLVM MCA would start at assembly instruction 402 and end at 441. Note that any instructions not supported by MCA (such as vcvtsi2sdq) will be completely omitted from the analysis. The part of the code that is actually in a loop is:

.L78:
        vxorpd  xmm0, xmm0, xmm0
        vcvtsi2sdq      xmm0, xmm0, rax
        test    rax, rax
        jns     .L75
        mov     rcx, rax
        and     eax, 1
        vxorpd  xmm0, xmm0, xmm0
        shr     rcx
        or      rcx, rax
        vcvtsi2sdq      xmm0, xmm0, rcx
        vaddsd  xmm0, xmm0, xmm0
.L75:
        vaddsd  xmm0, xmm0, QWORD PTR [rdx]
        vcomisd xmm0, xmm1
        vcvttsd2si      rax, xmm0
        jb      .L77
        vsubsd  xmm0, xmm0, xmm1
        vcvttsd2si      rax, xmm0
        xor     rax, rdi
.L77:
        add     rdx, 8
        cmp     rsi, rdx
        jne     .L78

请注意,在目标地址位于块中某处的代码中有一个条件跳转jns. MCA只是假设跳会落空.如果在实际的代码运行中不是这种情况,MCA将不必要地增加7条指令的开销.还有另外一个跳跃,jb,但是我认为这个跳跃对于大向量来说并不重要,并且会在大多数时候跌落.最后一个跳转jne也是最后一条指令,因此MCA会假定下一条指令再次是最上面的一条.对于足够多的迭代,此假设非常好.

Note that there is a conditional jump, jns, in the code whose target address is somewhere in the block. MCA will just assume that the jump will fall through. If this was not the case in an actual run of the code, MCA will unnecessarily add the overhead of 7 instructions. There is another jump, jb, but this one I think is not important for large vectors and will fall through most of the time. The last jump, jne, is also the last instruction, so MCA will assume that the next instruction is the top one again. For a sufficiently large number of iterations, this assumption is perfectly fine.

总的来说,很明显第一个代码比第二个代码小得多,因此可能要快得多.您的测量结果证实了这一点.您也确实不需要使用微体系结构分析工具来了解原因.第二个代码只是做更多的计算.因此,您可以快速得出结论,在所有体系结构上,通过0.0都在性能和代码大小方面更好.

Overall, it's obvious that first code is much smaller than the second one, and so it's probably much faster. Your measurements do confirm this. You also don't really need to use a microarchitecture analysis tool to understand why. The second code just does a lot more computations. So you can quickly conclude that passing 0.0 is better in terms of performance and code size on all architectures.

这篇关于(如何)可以使用LLVM机器代码分析器预测代码片段的运行时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆