有没有办法直接在堆上分配标准的 Rust 数组,完全跳过堆栈? [英] Is there any way to allocate a standard Rust array directly on the heap, skipping the stack entirely?

查看:25
本文介绍了有没有办法直接在堆上分配标准的 Rust 数组,完全跳过堆栈?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Stack Overflow 上已经有几个关于在堆上分配数组(比如 [i32])的问题.一般建议是拳击,例如框<[i32]>.但是,虽然装箱对于较小的数组足够好,但问题是被装箱的数组必须首先在堆栈上分配.

There are several questions already on Stack Overflow about allocating an array (say [i32]) on the heap. The general recommendation is boxing, e.g. Box<[i32]>. But while boxing works fine enough for smaller arrays, the problem is that the array being boxed has to first be allocated on the stack.

因此,如果数组太大(比如 1000 万个元素),您将 - 即使使用装箱 - 也会出现堆栈溢出(不太可能有这么大的堆栈).

So if the array is too large (say 10 million elements), you will - even with boxing - get a stack overflow (one is unlikely to have a stack that large).

然后建议使用 Vec 代替,即我们示例中的 Vec.虽然这确实可以完成工作,但它确实会对性能产生影响.

The suggestion then is using Vec<T> instead, that is Vec<i32> in our example. And while that does do the job, it does have a performance impact.

考虑以下程序:

fn main() {
    const LENGTH: usize = 10_000;
    let mut a: [i32; LENGTH] = [0; LENGTH];
    for j in 0..LENGTH {
        for i in 0..LENGTH {
            a[i] = j as i32;
        }
    }
}

time 告诉我这个程序大约需要 2.9 秒才能运行.我在这个例子中使用了 10'000,所以我可以在堆栈上分配它,但我真的想要一个 1000 万.

time tells me that this program takes about 2.9 seconds to run. I use 10'000 in this example, so I can allocate that on the stack, but I really want one with 10 million.

现在考虑相同的程序,但使用 Vec 代替:

Now consider the same program but with Vec<T> instead:

fn main() {
    const LENGTH: usize = 10_000;
    let mut a: Vec<i32> = vec![0; LENGTH];
    for j in 0..LENGTH {
        for i in 0..LENGTH {
            a[i] = j as i32;
        }
    }
}

time 告诉我这个程序大约需要 5 秒才能运行.现在时间不是很精确,但是对于这样一个简单的程序来说,大约2秒的差异并不是无足轻重的.

time tells me that this program takes about 5 seconds to run. Now time isn't super exact, but the difference of about 2 seconds for such a simple program is not an insignificant impact.

存储就是存储,带数组的程序和数组装箱一样快.所以不是堆降低了 Vec 版本的速度,而是 Vec 结构本身.

Storage is storage, the program with array is just as fast when the array is boxed. So it's not the heap slowing the Vec<T> version down, but the Vec<T> structure itself.

我也尝试过使用 HashMap(特别是 HashMap 来模仿数组结构),但这比 Vec 慢得多; 解决方案.

I also tried with a HashMap (specifically HashMap<usize, i32> to mimic an array structure), but that's far slower than the Vec<T> solution.

如果我的 LENGTH 是 1000 万,第一个版本甚至不会运行.

If my LENGTH had been 10 million, the first version wouldn't even have run.

如果这不可能,是否有一种结构在堆上的行为类似于数组(和 Vec),但可以匹配数组的速度和性能?

If that's not possible, is there a structure that behaves like an array (and Vec<T>) on the heap, but can match the speed and performance of an array?

推荐答案

总结:您的基准测试存在缺陷;只需使用 Vec(如 此处),可能带有 into_boxed_slice,因为它不可能比堆分配的数组慢.

Summary: your benchmark is flawed; just use a Vec (as described here), possibly with into_boxed_slice, as it is incredibly unlikely to be slower than a heap allocated array.

不幸的是,您的基准测试存在缺陷.首先,您可能没有进行优化编译(--release 用于货物,-O 用于 rustc).因为如果你愿意,Rust 编译器会删除你所有的代码.请参阅此处的程序集.为什么?因为你从不观察向量/数组,所以一开始就没有必要做所有这些工作.

Unfortunately, your benchmarks are flawed. First of all, you probably didn't compile with optimizations (--release for cargo, -O for rustc). Because if you would have, the Rust compiler would have removed all of your code. See the assembly here. Why? Because you never observe the vector/array, so there is no need to do all that work in the first place.

此外,您的基准测试并未测试您实际想要测试的内容.您正在将堆栈分配的数组与堆分配的向量进行比较.您应该将 Vec 与堆分配的数组进行比较.

Also, your benchmark is not testing what you actually want to test. You are comparing an stack-allocated array with a heap-allocated vector. You should compare the Vec to a heap allocated array.

不过不要难过:由于多种原因,编写基准测试非常困难.请记住:如果您对编写基准测试不太了解,最好不要在没有先询问其他人的情况下相信自己的基准测试.

Don't feel bad though: writing benchmarks is incredible hard for many reasons. Just remember: if you don't know a lot about writing benchmarks, better don't trust your own benchmarks without asking others first.

我修正了您的基准测试并包含了所有三种可能性:Vec、堆栈上的数组和堆上的数组.您可以在此处找到完整代码.结果是:

I fixed your benchmark and included all three possibilities: Vec, array on stack and array on heap. You can find the full code here. The results are:

running 3 tests
test array_heap  ... bench:   9,600,979 ns/iter (+/- 1,438,433)
test array_stack ... bench:   9,232,623 ns/iter (+/- 720,699)
test vec_heap    ... bench:   9,259,912 ns/iter (+/- 691,095)

惊奇:版本之间的差异远小于测量的方差.所以我们可以假设它们都同样快.

Surprise: the difference between the versions are way less than the variance of the measurement. So we can assume they all are pretty equally fast.

请注意,这个基准测试仍然很糟糕.这两个循环可以替换为一个循环,将所有数组元素设置为 LENGTH - 1.通过快速查看程序集(以及 9 毫秒的相当长的时间),我认为 LLVM 不够智能,无法实际执行此优化.但这样的事情很重要,人们应该意识到这一点.

Note that this benchmark is still pretty bad. The two loops can just be replaced by one loop setting all array elements to LENGTH - 1. From taking a quick look at the assembly (and from the rather long time of 9ms), I think that LLVM is not smart enough to actually perform this optimization. But things like this are important and one should be aware of that.

最后,让我们讨论为什么这两种解决方案应该同样快,以及速度是否真的存在差异.

Finally, let's discuss why both solutions should be equally fast and whether there are actually differences in speed.

Vec 的数据部分与 [T] 的内存布局完全相同:只有许多 T 连续在记忆中.超级简单.这也意味着两者都表现出相同的缓存行为(特别是对缓存非常友好).

The data section of a Vec<T> has exactly the same memory layout as a [T]: just many Ts contiguously in memory. Super simple. This also means both exhibit the same caching-behavior (specifically, being very cache-friendly).

唯一的区别是 Vec 的容量可能大于元素.所以Vec本身存储了(pointer, length, capacity).这比一个简单的(装箱的)切片(存储 (pointer, length))多一个字.装箱数组不需要存储长度,因为它已经在类型中,所以它只是一个简单的指针.无论如何,当您将拥有数百万个元素时,我们是否存储一个、两个或三个单词并不重要.

The only difference is that a Vec might have more capacity than elements. So Vec itself stores (pointer, length, capacity). That is one word more than a simple (boxed) slice (which stores (pointer, length)). A boxed array doesn't need to store the length, as it's already in the type, so it is just a simple pointer. Whether or not we store one, two or three words is not really important when you will have millions of elements anyway.

访问一个元素对所有三个都是一样的:我们先做一个边界检查,然后通过base_pointer + size_of::() * index计算目标指针.但重要的是要注意,数组在类型中存储其长度意味着优化器可以更轻松地删除边界检查!这可能是一个真正的优势.

Accessing one element is the same for all three: we do a bounds check first and then calculate the target pointer via base_pointer + size_of::<T>() * index. But it's important to note that the array storing its length in the type means that the bounds check can be removed more easily by the optimizer! This can be a real advantage.

然而,边界检查通常已经被智能优化器移除了.在我上面发布的基准代码中,程序集中没有边界检查.因此,虽然通过删除边界检查,盒装数组可能会快一点,(a) 这将是一个很小的性能差异,并且 (b) 您不太可能遇到很多情况,即删除数组的边界检查,但是不适用于 Vec/切片.

However, bounds checks are already usually removed by the smart optimizer. In the benchmark code I posted above, there are no bounds checks in the assembly. So while a boxed array could be a bit faster by removed bounds checks, (a) this will be a minor performance difference and (b) it's very unlikely that you will have a lot of situations where the bound check is removed for the array but not for the Vec/slice.

这篇关于有没有办法直接在堆上分配标准的 Rust 数组,完全跳过堆栈?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆