如何防止Rust基准测试库优化我的代码? [英] How can I prevent the Rust benchmark library from optimizing away my code?

查看:79
本文介绍了如何防止Rust基准测试库优化我的代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的主意,试图在Rust中进行基准测试.但是,当我使用test::Bencher进行测量时,我要与之进行比较的基本情况是:

#![feature(test)]
extern crate test;

#[cfg(test)]
mod tests {

    use test::black_box;
    use test::Bencher;

    const ITERATIONS: usize = 100_000;

    struct CompoundValue {
        pub a: u64,
        pub b: u64,
        pub c: u64,
        pub d: u64,
        pub e: u64,
    }

    #[bench]
    fn bench_in_place(b: &mut Bencher) {
        let mut compound_value = CompoundValue {
            a: 0,
            b: 2,
            c: 0,
            d: 5,
            e: 0,
        };

        let val: &mut CompoundValue = &mut compound_value;

        let result = b.iter(|| {
            let mut f : u64 = black_box(0);
            for _ in 0..ITERATIONS {
                f += val.a + val.b + val.c + val.d + val.e;
            }
            f = black_box(f);
            return f;
        });
        assert_eq!((), result);
    }
}

完全被编译器优化,从而导致:

running 1 test
test tests::bench_in_place ... bench:           0 ns/iter (+/- 1)

正如您在要点中看到的那样,我尝试采用建议解决方案

这里的问题是编译器可以看到每次iter调用闭包时循环的结果都是相同的(只需向),因为val永不更改.

查看程序集(通过将--emit asm传递给编译器)证明了这一点:

_ZN5tests14bench_in_place20h6a2d53fa00d7c649yaaE:
    ; ...
    movq    %rdi, %r14
    leaq    40(%rsp), %rdi
    callq   _ZN3sys4time5inner10SteadyTime3now20had09d1fa7ded8f25mjwE@PLT
    movq    (%r14), %rax
    testq   %rax, %rax
    je  .LBB0_3
    leaq    24(%rsp), %rcx
    movl    $700000, %edx
.LBB0_2:
    movq    $0, 24(%rsp)
    #APP
    #NO_APP
    movq    24(%rsp), %rsi
    addq    %rdx, %rsi
    movq    %rsi, 24(%rsp)
    #APP
    #NO_APP
    movq    24(%rsp), %rsi
    movq    %rsi, 24(%rsp)
    #APP
    #NO_APP
    decq    %rax
    jne .LBB0_2
.LBB0_3:
    leaq    24(%rsp), %rbx
    movq    %rbx, %rdi
    callq   _ZN3sys4time5inner10SteadyTime3now20had09d1fa7ded8f25mjwE@PLT
    leaq    8(%rsp), %rdi
    leaq    40(%rsp), %rdx
    movq    %rbx, %rsi
    callq   _ZN3sys4time5inner30_$RF$$u27$a$u20$SteadyTime.Sub3sub20h940fd3596b83a3c25kwE@PLT
    movups  8(%rsp), %xmm0
    movups  %xmm0, 8(%r14)
    addq    $56, %rsp
    popq    %rbx
    popq    %r14
    retq

.LBB0_2:jne .LBB0_2之间的部分是对iter的调用编译到的部分,它在传递给它的闭包中重复运行代码. #APP #NO_APP对是black_box调用.您可以看到iter循环并没有做很多事情:movq只是将数据从寄存器移至其他寄存器和堆栈之间,而addq/decq只是增加和减少了一些整数. /p>

在该循环的上方,有一个movl $700000, %edx:这正在将常量700_000加载到edx寄存器中……还有一个可疑的700000 = ITEARATIONS * (0 + 2 + 0 + 5 + 0). (代码中的其他内容并不那么有趣.)

掩饰此问题的方法是black_box输入,例如我可能会从基准测试开始,像这样:

#[bench]
fn bench_in_place(b: &mut Bencher) {
    let mut compound_value = CompoundValue {
        a: 0,
        b: 2,
        c: 0,
        d: 5,
        e: 0,
    };

    b.iter(|| {
        let mut f : u64 = 0;
        let val = black_box(&mut compound_value);
        for _ in 0..ITERATIONS {
            f += val.a + val.b + val.c + val.d + val.e;
        }
        f
    });
}

尤其是,val在闭包内是black_box,因此编译器无法预先计算加法并在每次调用时将其重用.

但是,这仍然被优化为非常快:对我来说是1 ns/iter.再次检查程序集会发现问题所在(我已将程序集缩减为仅包含APP/NO_APP对的循环,即对iter的闭包的调用):

.LBB0_2:
    movq    %rcx, 56(%rsp)
    #APP
    #NO_APP
    movq    56(%rsp), %rsi
    movq    8(%rsi), %rdi
    addq    (%rsi), %rdi
    addq    16(%rsi), %rdi
    addq    24(%rsi), %rdi
    addq    32(%rsi), %rdi
    imulq   $100000, %rdi, %rsi
    movq    %rsi, 56(%rsp)
    #APP
    #NO_APP
    decq    %rax
    jne .LBB0_2

现在,编译器已经看到valfor循环的过程中没有变化,因此它正确地将循环转换为只对val的所有元素求和(即4的顺序addq s),然后将其乘以ITERATIONS(imulq).

要解决此问题,我们可以做相同的事情:将black_box移得更深,以使编译器无法在循环的不同迭代之间推断出该值:

#[bench]
fn bench_in_place(b: &mut Bencher) {
    let mut compound_value = CompoundValue {
        a: 0,
        b: 2,
        c: 0,
        d: 5,
        e: 0,
    };

    b.iter(|| {
        let mut f : u64 = 0;
        for _ in 0..ITERATIONS {
            let val = black_box(&mut compound_value);
            f += val.a + val.b + val.c + val.d + val.e;
        }
        f
    });
}

对于我来说,此版本现在需要137,142 ns/iter,尽管重复调用black_box可能会造成不小的开销(必须反复写入堆栈,然后再读回去).

我们可以查看一下asm,只是要确保:

.LBB0_2:
    movl    $100000, %ebx
    xorl    %edi, %edi
    .align  16, 0x90
.LBB0_3:
    movq    %rdx, 56(%rsp)
    #APP
    #NO_APP
    movq    56(%rsp), %rax
    addq    (%rax), %rdi
    addq    8(%rax), %rdi
    addq    16(%rax), %rdi
    addq    24(%rax), %rdi
    addq    32(%rax), %rdi
    decq    %rbx
    jne .LBB0_3
    incq    %rcx
    movq    %rdi, 56(%rsp)
    #APP
    #NO_APP
    cmpq    %r8, %rcx
    jne .LBB0_2

现在,对iter的调用有两个循环:多次调用闭包的外部循环(.LBB0_2:jne .LBB0_2),以及闭包内部的for循环(.LBB0_3:jne .LBB0_3 ).内部循环确实正在调用black_box(APP/NO_APP),然后进行5个加法.外循环将f设置为零(xorl %edi, %edi),运行内循环,然后black_box设置f(第二个APP/NO_APP).

(准确地基准化要基准的基准可能很棘手!)

I have a simple idea I'm trying to benchmark in Rust. However, when I go to measure it using test::Bencher, the base case that I'm trying to compare against:

#![feature(test)]
extern crate test;

#[cfg(test)]
mod tests {

    use test::black_box;
    use test::Bencher;

    const ITERATIONS: usize = 100_000;

    struct CompoundValue {
        pub a: u64,
        pub b: u64,
        pub c: u64,
        pub d: u64,
        pub e: u64,
    }

    #[bench]
    fn bench_in_place(b: &mut Bencher) {
        let mut compound_value = CompoundValue {
            a: 0,
            b: 2,
            c: 0,
            d: 5,
            e: 0,
        };

        let val: &mut CompoundValue = &mut compound_value;

        let result = b.iter(|| {
            let mut f : u64 = black_box(0);
            for _ in 0..ITERATIONS {
                f += val.a + val.b + val.c + val.d + val.e;
            }
            f = black_box(f);
            return f;
        });
        assert_eq!((), result);
    }
}

is optimized away entirely by the compiler, resulting in:

running 1 test
test tests::bench_in_place ... bench:           0 ns/iter (+/- 1)

As you can see in the gist, I have tried to employ the suggestions set forth in the documentation, namely:

  • Making use of the test::black_box method to hide implementation details from the compiler.
  • Returning the calculated value from the closure passed to the iter method.

Are there any other tricks I can try?

解决方案

The problem here is the compiler can see that the result of the loop is the same every time iter calls the closure (just add some constant to f) because val never changes.

Looking at the assembly (by passing --emit asm to the compiler) demonstrates this:

_ZN5tests14bench_in_place20h6a2d53fa00d7c649yaaE:
    ; ...
    movq    %rdi, %r14
    leaq    40(%rsp), %rdi
    callq   _ZN3sys4time5inner10SteadyTime3now20had09d1fa7ded8f25mjwE@PLT
    movq    (%r14), %rax
    testq   %rax, %rax
    je  .LBB0_3
    leaq    24(%rsp), %rcx
    movl    $700000, %edx
.LBB0_2:
    movq    $0, 24(%rsp)
    #APP
    #NO_APP
    movq    24(%rsp), %rsi
    addq    %rdx, %rsi
    movq    %rsi, 24(%rsp)
    #APP
    #NO_APP
    movq    24(%rsp), %rsi
    movq    %rsi, 24(%rsp)
    #APP
    #NO_APP
    decq    %rax
    jne .LBB0_2
.LBB0_3:
    leaq    24(%rsp), %rbx
    movq    %rbx, %rdi
    callq   _ZN3sys4time5inner10SteadyTime3now20had09d1fa7ded8f25mjwE@PLT
    leaq    8(%rsp), %rdi
    leaq    40(%rsp), %rdx
    movq    %rbx, %rsi
    callq   _ZN3sys4time5inner30_$RF$$u27$a$u20$SteadyTime.Sub3sub20h940fd3596b83a3c25kwE@PLT
    movups  8(%rsp), %xmm0
    movups  %xmm0, 8(%r14)
    addq    $56, %rsp
    popq    %rbx
    popq    %r14
    retq

The section between .LBB0_2: and jne .LBB0_2 is what the call to iter compiles down to, it is repeatedly running the code in the closure that you passed to it. The #APP #NO_APP pairs are the black_box calls. You can see that the iter loop doesn't do much: movq is just moving data from register to/from other registers and the stack, and addq/decq are just adding and decrementing some integers.

Looking above that loop, there's movl $700000, %edx: This is loading the constant 700_000 into the edx register... and, suspiciously, 700000 = ITEARATIONS * (0 + 2 + 0 + 5 + 0). (The other stuff in the code isn't so interesting.)

The way to disguise this is to black_box the input, e.g. I might start off with the benchmark written like:

#[bench]
fn bench_in_place(b: &mut Bencher) {
    let mut compound_value = CompoundValue {
        a: 0,
        b: 2,
        c: 0,
        d: 5,
        e: 0,
    };

    b.iter(|| {
        let mut f : u64 = 0;
        let val = black_box(&mut compound_value);
        for _ in 0..ITERATIONS {
            f += val.a + val.b + val.c + val.d + val.e;
        }
        f
    });
}

In particular, val is black_box'd inside the closure, so that the compiler can't precompute the addition and reuse it for each call.

However, this is still optimised to be very fast: 1 ns/iter for me. Checking the assembly again reveals the problem (I've trimmed the assembly down to just the loop that contains the APP/NO_APP pairs, i.e. the calls to iter's closure):

.LBB0_2:
    movq    %rcx, 56(%rsp)
    #APP
    #NO_APP
    movq    56(%rsp), %rsi
    movq    8(%rsi), %rdi
    addq    (%rsi), %rdi
    addq    16(%rsi), %rdi
    addq    24(%rsi), %rdi
    addq    32(%rsi), %rdi
    imulq   $100000, %rdi, %rsi
    movq    %rsi, 56(%rsp)
    #APP
    #NO_APP
    decq    %rax
    jne .LBB0_2

Now the compiler has seen that val doesn't change over the course of the for loop, so it has correctly transformed the loop into just summing all the elements of val (that's the sequence of 4 addqs) and then multiplying that by ITERATIONS (the imulq).

To fix this, we can do the same thing: move the black_box deeper, so that the compiler can't reason about the value between different iterations of the loop:

#[bench]
fn bench_in_place(b: &mut Bencher) {
    let mut compound_value = CompoundValue {
        a: 0,
        b: 2,
        c: 0,
        d: 5,
        e: 0,
    };

    b.iter(|| {
        let mut f : u64 = 0;
        for _ in 0..ITERATIONS {
            let val = black_box(&mut compound_value);
            f += val.a + val.b + val.c + val.d + val.e;
        }
        f
    });
}

This version now takes 137,142 ns/iter for me, although the repeated calls to black_box probably cause non-trivial overhead (having to repeatedly write to the stack, and then read it back).

We can look at the asm, just to be sure:

.LBB0_2:
    movl    $100000, %ebx
    xorl    %edi, %edi
    .align  16, 0x90
.LBB0_3:
    movq    %rdx, 56(%rsp)
    #APP
    #NO_APP
    movq    56(%rsp), %rax
    addq    (%rax), %rdi
    addq    8(%rax), %rdi
    addq    16(%rax), %rdi
    addq    24(%rax), %rdi
    addq    32(%rax), %rdi
    decq    %rbx
    jne .LBB0_3
    incq    %rcx
    movq    %rdi, 56(%rsp)
    #APP
    #NO_APP
    cmpq    %r8, %rcx
    jne .LBB0_2

Now the call to iter is two loops: the outer loop that calls the closure many times (.LBB0_2: to jne .LBB0_2), and the for loop inside the closure (.LBB0_3: to jne .LBB0_3). The inner loop is indeed doing a call to black_box (APP/NO_APP) followed by 5 additions. The outer loop is setting f to zero (xorl %edi, %edi), running the inner loop, and then black_boxing f (the second APP/NO_APP).

(Benchmarking exactly what you want to benchmark can be tricky!)

这篇关于如何防止Rust基准测试库优化我的代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆