展开将如何影响每个元素计数 CPE 的周期 [英] how will unrolling affect the cycles per element count CPE

查看:26
本文介绍了展开将如何影响每个元素计数 CPE 的周期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  1. 如何使用这些代码片段计算 CPE(每个元素的周期数)?
  2. 两个给定代码片段之间的 CPE 有何不同?


我有这段代码

void randomFunction(float a[],float Tb[],float c[],long int n){输入 i,j,k;for(i=0;i

这是最内层循环的程序集,来自 GCC10.3 -O2 (

提前感谢您的帮助!

解决方案

您的循环在 addss 延迟(浮点添加)上完全成为瓶颈,每 1 个元素 3 个周期.乱序执行让其他工作在影子"中运行.这个的.假设你的类似 Haswell 的 CPU 有足够的内存带宽来跟上1

这种展开方式根本没有帮助,不会改变串行依赖链,除非您使用 -ffast-math 或其他东西让编译器转动它进入

temp1 += a[...+0]* Tb[...+0];temp2 += a[...+1]* Tb[...+1];

或者在将它们提供给串行依赖之前添加对,例如

temp += a[]*Tb[] + a[+1]*Tb[+1];

一个长的串行依赖链是最糟糕的,而且在数值上也不是很好:成对求和(或者特别是使用多个累加器在这个方向上的一步)在数值上会更好,而且性能也更好.(Simd matmul 程序给出不同的数值结果).

(如果您的多个累加器是 SIMD 向量的 4 个元素,您可以使用相同模式的 asm 指令完成 4 倍的工作量.但是您需要展开多个 向量,因为addps 在现代 x86 CPU 上具有与 addss 相同的性能特征.)

脚注 1:每 3 个周期 4 个字节的两个连续读取流;当然,台式机 Haswell 可以跟上,甚至可能是 Xeon 与许多其他内核竞争内存带宽.但是从 a[] 读取的数据可能会命中缓存,因为 a[i*n+k] 重复是同一行,直到我们继续进行下一次外循环迭代.因此,当我们扫描一行 Tb 时,只有 1 行 a[] 必须在缓存中保持热状态(以便在下一次中间迭代中获得命中).因此,如果 n 不是 巨大,则 a[] 必须从 DRAM 中输入一次,但我们会遍历整个 Tb[] 顺序 n 次.


更详细的版本

请参阅预测延迟需要考虑哪些因素用于现代超标量处理器上的操作以及如何手动计算它们? - 查找依赖链(在这种情况下,addss%xmm1 中).还有 数据流图的旋风介绍.

然后寻找后端和前端的吞吐量瓶颈.在您的情况下,延迟占主导地位.(假设前端匹配这个 Haswell 后端,尽管跟上这个后端延迟瓶颈并不需要太多.另外,我讨厌他们给他们的功能单元"编号; 从 1 开始,而不是遵循 Intel 对具有 ALU 的端口 0、1、5、6 进行编号.Haswell 的 FP 加法器在端口 1 上,端口 2/3 是加载/存储 AGU 单元等)

ADDSS 有 3 个周期的延迟,所以 temp += ... 每 3 个周期只能执行一次. load/MULSS 操作只是独立地为 ADDSS 准备输入,并且循环开销也是独立的.


请注意,如果不是延迟瓶颈,您的循环将在真实 Haswell 的前端(每个周期 4 个融合域 uops)上出现瓶颈,而不是在后端功能单元上.循环是 5 个融合域 uops,假设 cmp/jne 的宏融合,并且尽管采用索引寻址模式,Haswell 仍可以保持内存源添加微融合.(Sandybridge 会将其取消层压.)

在一般情况下,了解后端功能单元是不够的.前端也是一个常见的瓶颈,尤其是在必须存储一些东西的循环中,比如实际的 matmul.

但是由于对 ADDSS 的串行依赖(它实际上跨越外循环),唯一重要的是依赖链.

即使是对内循环的最后一次迭代的分支预测错误(当分支未被采用而不是正常采用时),这只会给后端更多时间来咀嚼那些挂起的 ADDSS 操作,而前端自行排序并开始下一个内部循环.

由于您以不改变串行依赖的方式展开,因此除了微小的 n 之外,它对性能的影响为零.(对于微小的 n,整个事情可能会与调用者在调用之前/之后的独立工作重叠.在这种情况下,保存指令可能会有所帮助,还允许乱序执行者看得更远";. 了解lfence 对具有两个长依赖链的循环的影响,对于增加的长度 是这样一种情况,其中 OoO exec 可以(部分)重叠两个独立的 imul dep 链,它们在程序顺序中是一个另一个.

当然,到那时,您正在考虑所展示内容之外的代码.即使对于 n=10,也就是 10^3 = 1000 次内部迭代,而 Haswell 的 ROB 也只有 192 uops 大,RS 为 60 个条目.(https://www.realworldtech.com/haswell-cpu/3/).


以有用的方式展开

另见 为什么 mulss 在 Haswell 上只需要 3 个周期,与 Agner 的指令表不同?(展开具有多个累加器的 FP 循环) re:以确实创建更多指令级并行性、隐藏 FP 依赖链的方式展开.

以不同方式展开,每次循环迭代仅在 temp 中求和一次,每次迭代将保持相同的周期,同时仍使您处理的元素数量加倍.

 for(k=0;k

显然,您可以继续这样做,直到遇到前端或后端吞吐量限制,例如每个时钟增加一个.上面的版本每 3 个时钟增加两次.

您的功能单元"表没有列出 FMA(融合乘加),但真正的 Haswell 有它,性能与 FP mul 相同.如果有的话,它不会有太大帮助,因为您当前的循环结构每个 mul+add 执行 2 个负载,因此将其减少到 2 个负载和一个 FMA 仍然会成为负载瓶颈.可能有助于提高前端吞吐量?

使用 a[i*n+k]a[(i+1)*n+k] 展开一个外循环可能有助于减少负载 带有一个 Tb[j*n+k].这当然会改变计算顺序,因此对于没有 -ffast-math 的编译器来说是不合法的,因为 FP 数学不是严格关联的.


这是一个 matmul 的减少,允许更好的优化

(等一下,您的代码没有显示 temp 在哪里重新初始化,或者 c[] arg 的用途.我只是假设它是全局的或者别的什么,但可能你实际上屠宰了一个普通的 matmul 函数,它在每次内部迭代后存储一个单独的 tempc[] 取出它.在这种情况下,OoO exec在单独的中间循环迭代之间 与中等规模的 n 相关.但是您没有显示调度程序/ROB 大小,这不是您可以轻松建模的东西;您需要实际基准.所以这部分可能只适用于我发明的问题,而不是你想问的问题!)

您的循环似乎在对 matmul 结果的元素求和,但仍然像 matmul 一样结构化.即进行行 x 列点积,但不是将其存储到 N x N result[...] 矩阵中,而是对结果求和.

这相当于对两个矩阵之间元素的每对乘积求和.由于我们不再需要将单独的行 x 列点积分开,这使得 lot 优化!(这个求和顺序没有什么特别或理想的;其他顺序会有不同但可能不会更糟的舍入误差.)

例如,您不需要将 b 转置为 Tb,只需使用 b(除非它自然已经转置了,在这种情况下就可以了).您的矩阵是方阵,所以根本没有.

此外,您可以简单地从 a[] 加载一个或几个元素并循环遍历 Tb[],使用 FMA 操作执行这些产品,每个 FMA 加载一个, 进入 10 或 12 个向量累加器.(或者当然缓存阻止它以循环 Tb 的连续部分,该部分可以在 L1d 缓存中保持热度.)

这可能接近 Haswell 的最大 FLOPS 吞吐量,即每个时钟 2x 256 位 FMA = 8(每个 YMM 向量的 float 元素)x 2 FMA/时钟 x 2 FLOP/FMA = 32 FLOP/时钟.

  1. how do I calculate CPE (cycles per element) with these code snippets?
  2. what is the difference in the CPE between the 2 given code snippets?


I have this piece of code

void randomFunction(float a[],float Tb[],float c[],long int n){
        int i,j,k;
        for(i=0;i<n;++i)
            for(j=0;j<n;++j)
                for(k=0;k<n-1;k++){ 
                                temp+= a[i*n+k] * Tb[j*n+k];
                }
    
}

This is the assembly for the inner-most loop, from GCC10.3 -O2 (https://godbolt.org/z/cWE16rW1r), for a version of the function with a local float temp=0; that it returns so the loop won't be optimized away:

.L4:
    movss   (%rcx,%rax), %xmm0
    mulss   (%rdx,%rax), %xmm0
    addq    $4, %rax
    addss   %xmm0, %xmm1
    cmpq    %rax, %rsi
    jne     .L4


Now I am trying to 'optimise it' using unrolling.

void randomUnrollingFunction(float a[],float Tb[],float c[],long int n){
    int i,j,k;
    for(i=0;i<n;++i)
        for(j=0;j<n;++j)
            for(k=0;k<n-1;k+=2){//this is the unrolled portion by 2
                            temp+= a[i*n+k] * Tb[j*n+k];
                            temp+= a[i*n+k+1]* Tb[j*n+k+1];
            }

}

I am wondering what is the the estimated CPE that will be achieved with this unrolling by factor 2.
CPE is number of cycles/ number of instructions

This is the information for the latency:

Thank you for any help in advance!

解决方案

Your loop entirely bottlenecks on addss latency (float add), at 3 cycles per 1 element. Out-of-order exec lets the other work run in the "shadow" of this. Assuming your Haswell-like CPU has the memory bandwidth to keep up1

This way of unrolling doesn't help at all, not changing the serial dependency chain unless you compiled with -ffast-math or something to let a compiler turn it into

temp1 += a[...+0]* Tb[...+0];
temp2 += a[...+1]* Tb[...+1];

Or add pairs before feeding them into that serial dependency, like

temp +=  a[]*Tb[] + a[+1]*Tb[+1];

One long serial dependency chain is the worst, and also numerically not great: pairwise summation (or especially just a step in that direction using multiple accumulators) would be numerically better and also perform better. (Simd matmul program gives different numerical results).

(If your multiple accumulators are 4 elements of on SIMD vector, you can do 4x the amount of work with the same pattern of asm instructions. But you then need to unroll with multiple vectors, because addps has the same performance characteristics as addss on modern x86 CPUs.)

Footnote 1: Two sequential read streams of 4 bytes each per 3 cycles; certainly a desktop Haswell could keep up, and probably even a Xeon competing with many other cores for memory bandwidth. But the reads from a[] probably hit in cache, because a[i*n+k] is the same row repeatedly until we move on to the next outer-loop iteration. So only 1 row of a[] has to stay hot in cache (to get hits next middle iteration) while we scan through a row of Tb. So a[] has to come in from DRAM once total, if n isn't huge, but we loop over the whole Tb[] in order n times.


More detailed version

See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - look for the dependency chains (in this case the addss into %xmm1). Also A whirlwind introduction to dataflow graphs.

Then look for throughput bottlenecks in the back-end and front-end. In your case, latency dominates. (Assuming the front-end matches this Haswell back-end, although it wouldn't take much to keep up with this back-end latency bottleneck. Also, I hate that they number their "functional units" from 1, instead of following Intel's numbering of ports 0,1,5,6 having ALUs. Haswell's FP adder is on port 1, and ports 2/3 are load/store-AGU units, etc.)

ADDSS has 3 cycle latency, so temp += ... can only execute once per 3 cycles. The load / MULSS operations just independently prepare inputs for ADDSS, and the loop overhead is also independent.


Note that if not for the latency bottleneck, your loop would bottleneck on the front-end on a real Haswell (4 fused-domain uops per cycle), not on back-end functional units. The loop is 5 fused-domain uops, assuming macro-fusion of the cmp/jne, and that Haswell can keep the memory-source addss micro-fused despite the indexed addressing mode. (Sandybridge would un-laminate it.)

In the general case, knowing the back-end functional units is not sufficient. The front-end is also a common bottleneck, especially in loops that have to store something, like an actual matmul.

But thanks to the serial dependency on ADDSS (which actually carries across outer loops), the only thing that matters is that dependency chain.

Even a branch mispredict on the last iteration of the inner loop (when the branch is not-taken instead of normally taken), that will just give the back end more time to chew through those pending ADDSS operations while the front-end sorts itself out and starts in on the next inner loop.

Since you unrolled in a way that doesn't change the serial dependency, it makes zero difference to performance except for tiny n. (For tiny n, the whole thing can potentially overlap some of this work with independent work in the caller before/after the call. In that case, saving instructions could be helpful, also allowing out-of-order exec to "see farther". Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths is a case where OoO exec can (partially) overlap two independent imul dep chains that in program-order are one after the other.

Of course at that point, you're considering code outside what you're showing. And even for n=10, that's 10^3 = 1000 inner iterations, and Haswell's ROB is only 192 uops large, with an RS of 60 entries. (https://www.realworldtech.com/haswell-cpu/3/).


Unrolling in a useful way

See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) re: unrolling in ways that do create more instruction-level parallelism, hiding FP dependency chains.

Unrolling differently, only summing once into temp per loop iteration, will keep the same cycles per iteration while still doubling the number of elements you process.

            for(k=0;k<n-1;k+=2){//this is the unrolled portion by 2
                            temp +=  (a[i*n+k] * Tb[j*n+k]  +
                                      a[i*n+k+1]* Tb[j*n+k+1]);
            }

Obviously you can keep doing this until you run into front-end or back-end throughput limits, like one add per clock. The above version does two adds per 3 clocks.

Your "functional unit" table doesn't list FMA (fused multiply-add), but real Haswell has it, with identical performance to FP mul. Iit wouldn't help much if any because your current loop structure does 2 loads per mul+add, so reducing that to 2 loads and one FMA would still bottleneck on loads. Might help with front-end throughput?

What might help reduce loads is unrolling over one of the outer loops, using both a[i*n+k] and a[(i+1)*n+k] with one Tb[j*n+k]. That of course changes the order of the calculation, so isn't legal for a compiler without -ffast-math because FP math isn't strictly associative.


This is a reduction of a matmul, allowing much better optimization

(err wait, your code didn't show where temp was re-initialized, or what the c[] arg was for. I just assumed it was global or something, but probably you actually butchered a normal matmul function that stores a separate temp after every inner iteration to c[] by taking that out. In that case, OoO exec between separate middle loop iteration is relevant for medium-sized n. But you don't show the scheduler / ROB sizes, and that's not something you can easily model; you need to actually benchmark. So this section is probably only applicable to the question I invented, not what you meant to ask!)

Your loops appear to be summing the elements of a matmul result, but still structured like a matmul. i.e. do a row x column dot product, but instead of storing that into an N x N result[...] matrix, you're just summing the result.

That amounts to summing every pairwise product of elements between two matrices. Since we don't need to keep separate row x column dot products separate anymore, that enables a lot of optimizations! (There's nothing special or ideal about this order of summation; other orders will have different but likely not worse rounding error.)

For example, you don't need to transpose b into Tb, just use b (unless it was naturally already transposed, in which case that's fine). Your matrices are square so it doesn't matter at all.

Furthermore, you can simply load one or a couple elements from a[] and loop over Tb[], doing those products with FMA operations, one load per FMA, into 10 or 12 vector accumulators. (Or of course cache-block this to loop over a contiguous part of Tb that can stay hot in L1d cache.)

That could approach Haswell's max FLOPS throughput of 2x 256-bit FMA per clock = 8 (float elements per YMM vector) x 2 FMAs / clock x 2 FLOP/FMA = 32 FLOP / clock.

这篇关于展开将如何影响每个元素计数 CPE 的周期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆