在具有240个或更多元素的数组上循环时,为什么会对性能产生较大影响? [英] Why is there a large performance impact when looping over an array with 240 or more elements?

查看:107
本文介绍了在具有240个或更多元素的数组上循环时,为什么会对性能产生较大影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当在Rust中的数组上运行求和循环时,我注意到CAPACITY> = 240时性能大幅下降.CAPACITY = 239快约80倍.

When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.

Rust正在为短"数组进行特殊的编译优化吗?

Is there special compilation optimization Rust is doing for "short" arrays?

rustc -C opt-level=3编译.

use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main() {
    let mut arr = [0; CAPACITY];
    for i in 0..CAPACITY {
        arr[i] = i;
    }
    let mut sum = 0;
    let now = Instant::now();
    for _ in 0..IN_LOOPS {
        let mut s = 0;
        for i in 0..arr.len() {
            s += arr[i];
        }
        sum += s;
    }
    println!("sum:{} time:{:?}", sum, now.elapsed());
}

推荐答案

摘要:低于240时,LLVM完全展开内部循环,并使其注意到可以优化重复循环,从而中断了您的基准.

Summary: below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.


您发现了一个不可思议的阈值,超过该阈值LLVM停止执行某些优化.阈值为8字节* 240 = 1920字节(您的数组是usize的数组,因此,假设x86-64 CPU,则将长度乘以8字节).在此基准测试中,仅对长度239执行的一项特定优化导致了巨大的速度差异.但是,让我们慢慢开始:

You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization – only performed for length 239 – is responsible for the huge speed difference. But let's start slowly:

(此答案中的所有代码均用-C opt-level=3编译)

pub fn foo() -> usize {
    let arr = [0; 240];
    let mut s = 0;
    for i in 0..arr.len() {
        s += arr[i];
    }
    s
}

这个简单的代码将大致产生人们所期望的程序集:一个添加元素的循环.但是,如果将240更改为239,则发出的程序集相差很大. 在Godbolt编译器资源管理器中查看它.这是程序集的一小部分:

This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. Here is a small part of the assembly:

movdqa  xmm1, xmmword ptr [rsp + 32]
movdqa  xmm0, xmmword ptr [rsp + 48]
paddq   xmm1, xmmword ptr [rsp]
paddq   xmm0, xmmword ptr [rsp + 16]
paddq   xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq   xmm0, xmmword ptr [rsp + 1840]
paddq   xmm1, xmmword ptr [rsp + 1856]
paddq   xmm0, xmmword ptr [rsp + 1872]
paddq   xmm0, xmm1
pshufd  xmm1, xmm0, 78
paddq   xmm1, xmm0

这就是所谓的循环展开:LLVM将循环主体粘贴一段时间,以避免必须执行所有这些循环管理指令",即增加循环变量,检查循环是否具有结束并跳转到循环的开始.

This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. incrementing the loop variable, check if the loop has ended and the jump to the start of the loop.

如果您想知道:paddq和类似的指令是SIMD指令,它们允许并行求和多个值.此外,并行使用两个16字节SIMD寄存器(xmm0xmm1),这样CPU的指令级并行性基本上可以同时执行其中两个指令.毕竟,它们彼此独立.最后,将两个寄存器加在一起,然后水平求和,得到标量结果.

In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.

现代主流x86 CPU(非低功耗Atom)在L1d缓存中命中时,每个时钟确实可以完成2个矢量加载,并且paddq吞吐量也至少每个时钟2个,而大多数CPU上只有1个周期的延迟.请参见 https://agner.org/optimize/以及

Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.

LLVM不能完全展开时,会展开 some 小循环,并且仍然使用多个累加器.因此,通常情况下,即使没有完全展开,前端带宽和后端延迟瓶颈对于LLVM生成的循环来说也不是一个大问题.

LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually, front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.

但是,循环展开不对80倍的性能差异负责!至少不是单独进行循环展开.让我们看一下实际的基准测试代码,该代码将一个循环放入另一个循环中:

But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:

const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize {
    let mut arr = [0; CAPACITY];
    for i in 0..CAPACITY {
        arr[i] = i;
    }

    let mut sum = 0;
    for _ in 0..IN_LOOPS {
        let mut s = 0;
        for i in 0..arr.len() {
            s += arr[i];
        }
        sum += s;
    }

    sum
}

(在Godbolt编译器资源管理器上)

CAPACITY = 240的程序集看起来很正常:两个嵌套循环. (在函数开始时,有很多代码仅用于初始化,我们将忽略它们.)但是,对于239,它看起来却大不相同!我们看到初始化循环和内部循环都展开了:到目前为止,如此预期.

The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.

重要的区别在于,对于239,LLVM能够确定内部循环的结果不取决于外部循环!因此,LLVM发出的代码基本上是首先执行的仅内部循环(计算总和),然后通过将sum加一堆来模拟外部循环!

The important difference is that for 239, LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!

首先,我们看到与上面几乎相同的程序集(表示内部循环的程序集).之后,我们看到了这一点(我发表了评论以解释程序集;使用*的评论特别重要):

First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly; the comments with * are especially important):

        ; at the start of the function, `rbx` was set to 0

        movq    rax, xmm1     ; result of SIMD summing up stored in `rax`
        add     rax, 711      ; add up missing terms from loop unrolling
        mov     ecx, 500000   ; * init loop variable outer loop
.LBB0_1:
        add     rbx, rax      ; * rbx += rax
        add     rcx, -1       ; * decrement loop variable
        jne     .LBB0_1       ; * if loop variable != 0 jump to LBB0_1
        mov     rax, rbx      ; move rbx (the sum) back to rax
        ; two unimportant instructions omitted
        ret                   ; the return value is stored in `rax`

正如您在此处看到的那样,将获取内循环的结果,并与外循环运行然后返回的次数相加. LLVM只能执行此优化,因为它知道内部循环独立于外部循环.

As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.

这意味着运行时间从CAPACITY * IN_LOOPS更改为CAPACITY + IN_LOOPS .这是造成巨大性能差异的原因.

This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.

附加说明:您对此可以采取任何措施吗?并不真地. LLVM必须具有不可思议的阈值,否则,LLVM优化可能要花一些时间才能完全完成某些代码.但是我们也可以同意该代码是高度人为的.实际上,我怀疑会发生如此巨大的差异.在这些情况下,由于全循环展开而引起的差异通常甚至不是2倍.因此,无需担心实际用例.

An additional note: can you do anything about this? Not really. LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.

作为关于惯用的Rust代码的最后说明:arr.iter().sum()是汇总数组中所有元素的更好方法.并且在第二个示例中对此进行更改不会导致所发出的装配存在任何显着差异.除非您认为它会降低性能,否则应使用简短的惯用版本.

As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.

这篇关于在具有240个或更多元素的数组上循环时,为什么会对性能产生较大影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆