删除边界检查后,为什么我的代码运行速度较慢? [英] Why does my code run slower when I remove bounds checks?

查看:120
本文介绍了删除边界检查后,为什么我的代码运行速度较慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Rust中编写线性代数库.

我有一个功能,可以在给定的行和列上获取对矩阵单元的引用.此函数从一对断言开始,即行和列在范围之内:

#[inline(always)]
pub fn get(&self, row: usize, col: usize) -> &T {
    assert!(col < self.num_cols.as_nat());
    assert!(row < self.num_rows.as_nat());
    unsafe {
        self.get_unchecked(row, col)
    }
}

在紧密循环中,我认为跳过边界检查可能会更快,因此我提供了get_unchecked方法:

#[inline(always)]
pub unsafe fn get_unchecked(&self, row: usize, col: usize) -> &T {
    self.data.get_unchecked(self.row_col_index(row, col))
}

奇怪的是,当我使用这些方法(通过行和列迭代器)实现矩阵乘法时,我的基准测试表明,当我检查边界时,它实际上快了约33%.为什么会这样?

我已经在两台不同的计算机上进行了尝试,其中一台运行Linux,另一台运行OSX,并且都显示出了效果.

完整代码为github上的 .相关文件为 lib.rs .感兴趣的功能是:

  • get在第68行
  • get_unchecked在第81行
  • next在551行
  • mul在796行
  • matrix_mul(基准)在第1038行

请注意,我正在使用类型级别的数字对矩阵进行参数化(也可以通过虚拟标记类型来选择动态尺寸),因此基准测试将两个100x100矩阵相乘.

更新:

我已经大大简化了代码,删除了基准测试中未直接使用的内容,并删除了通用参数.我还编写了不使用迭代器的乘法实现,并且该版本不会产生相同的效果.有关此版本的代码,请参见此处.克隆minimal-performance分支并运行cargo bench将对两种不同的乘法实现进行基准测试(请注意,断言从该分支的开始就被注释掉了.)

还要注意的是,如果更改get*函数以返回数据的副本而不是引用(f64而不是&f64),效果将消失(但是代码总体上会稍微慢一些).

解决方案

这不是一个完整的答案,因为我尚未测试我的主张,但这可能可以解释它.无论哪种方式,唯一可以确定的方法就是生成LLVM IR和汇编器输出.如果您需要LLVM IR手册,可以在这里找到: http://llvm.org/docs /LangRef.html .

无论如何,足够了.假设您有以下代码:

#[inline(always)]
pub unsafe fn get_unchecked(&self, row: usize, col: usize) -> &T {
    self.data.get_unchecked(self.row_col_index(row, col))
}

此处的编译器将其更改为间接加载,可能会在紧密循环中对其进行优化.有趣的是,每次加载都有可能出错:如果您的数据不可用,则会触发越界.

在将边界检查与紧密循环结合使用的情况下,LLVM会起到一些作用.由于负载处于紧密循环(矩阵乘法)中,并且由于边界检查的结果取决于循环的边界,因此它将从循环中删除边界检查并将其放在 around 中环形.换句话说,循环本身将保持完全相同,但要进行额外的边界检查.

换句话说,代码是完全相同的,只是有一些细微的差异.

那又发生了什么变化?两件事:

  1. 如果我们进行了其他范围检查,则不再可能超出范围.这可能会触发以前无法实现的优化.不过,考虑到通常如何执行这些检查,这不是我的猜测.

  2. 要考虑的另一件事是,不安全"一词可能会触发某些行为,例如附加条件,引脚数据或禁用GC等.找出这些细节的唯一方法是查看LLVM IR.

I'm writing a linear algebra library in Rust.

I have a function to get a reference to a matrix cell at a given row and column. This function starts with a pair of assertions that the row and column are within bounds:

#[inline(always)]
pub fn get(&self, row: usize, col: usize) -> &T {
    assert!(col < self.num_cols.as_nat());
    assert!(row < self.num_rows.as_nat());
    unsafe {
        self.get_unchecked(row, col)
    }
}

In tight loops, I thought it might be faster to skip the bounds check, so I provide a get_unchecked method:

#[inline(always)]
pub unsafe fn get_unchecked(&self, row: usize, col: usize) -> &T {
    self.data.get_unchecked(self.row_col_index(row, col))
}

The bizarre thing is, when I use these methods to implement matrix multiplication (via row and column iterators), my benchmarks show that it actually goes about 33% faster when I check the bounds. Why is this happening?

I've tried this on two different computers, one running Linux and the other OSX, and both show the effect.

The full code is on github. The relevant file is lib.rs. Functions of interest are:

  • get at line 68
  • get_unchecked at line 81
  • next at line 551
  • mul at line 796
  • matrix_mul (benchmark) at line 1038

Note that I'm using type-level numbers to parameterise my matrices (with the option for dynamic sizes too via dummy tagged types), so the benchmark is multiplying two 100x100 matrices.

UPDATE:

I've significantly simplified the code, removing stuff not directly used in the benchmark and removing generic parameters. I also wrote an implementation of multiplication without using iterators, and that version doesn't cause the same effect. See here for this version of the code. Cloning the minimal-performance branch and running cargo bench will benchmark the two different implementations of multiplication (note that the assertions are commented out to start with in that branch).

Also of note is that if I change the get* functions to return copies of the data instead of references (f64 instead of &f64), the effect disappears (but the code is slightly slower all round).

解决方案

It's not a complete answer because I haven't tested my claims, but this might explain it. Either ways, the only way to know for sure is to generate the LLVM IR and the assembler output. If you need a manual for LLVM IR, you can find it here: http://llvm.org/docs/LangRef.html .

Anyways, enough about that. Let's say you have this code:

#[inline(always)]
pub unsafe fn get_unchecked(&self, row: usize, col: usize) -> &T {
    self.data.get_unchecked(self.row_col_index(row, col))
}

The compiler here changes this into an indirect load, which will probably be optimized in a tight loop. It's interesting to note that each load has a possibility to go wrong: if your data isn't available, it'll trigger an out-of-bounds.

In the case with the bounds check combined with the tight loop, LLVM does a little trick. Because the load is in a tight loop (a matrix multiplication) and because the result of the bounds check depends on the bounds of the loop, it will remove the bounds check from the loop and put it around the loop. In other words, the loop itself will remain exactly the same, but with an extra bounds check.

In other words, the code is exactly the same, with some minor differences.

So what changed? Two things:

  1. If we have the additional bounds check, there's no possibility anymore for an out-of-bounds load. This might trigger an optimization that wasn't possible before. Still, considering how these checks are usually implemented, this wouldn't be my guess.

  2. Another thing to consider is that the word 'unsafe' might trigger some behavior, like an additional condition, pin data or disable the GC, etc. I'm not sure about this exact behavior in Rust; the only way to find out these details is to look at the LLVM IR.

这篇关于删除边界检查后,为什么我的代码运行速度较慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆