有什么方法可以直接在堆上分配标准的Rust数组,完全跳过堆栈吗? [英] Is there any way to allocate a standard Rust array directly on the heap, skipping the stack entirely?

查看:157
本文介绍了有什么方法可以直接在堆上分配标准的Rust数组,完全跳过堆栈吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于在堆上分配数组(例如[i32])的问题,Stack Overflow已经有几个问题.一般建议是拳击,例如Box<[i32]>.但是,虽然装箱对较小的数组可以很好地工作,但问题是被装箱的数组必须首先分配到堆栈上.

There are several questions already on Stack Overflow about allocating an array (say [i32]) on the heap. The general recommendation is boxing, e.g. Box<[i32]>. But while boxing works fine enough for smaller arrays, the problem is that the array being boxed has to first be allocated on the stack.

因此,如果数组太大(例如1000万个元素),即使装箱也将导致堆栈溢出(一个堆栈不太可能那么大).

So if the array is too large (say 10 million elements), you will - even with boxing - get a stack overflow (one is unlikely to have a stack that large).

然后建议使用Vec<T>代替,在我们的示例中为Vec<i32>.尽管这样做确实可行,但确实会对性能产生影响.

The suggestion then is using Vec<T> instead, that is Vec<i32> in our example. And while that does do the job, it does have a performance impact.

请考虑以下程序:

fn main() {
    const LENGTH: usize = 10_000;
    let mut a: [i32; LENGTH] = [0; LENGTH];
    for j in 0..LENGTH {
        for i in 0..LENGTH {
            a[i] = j as i32;
        }
    }
}

time告诉我该程序大约需要2.9秒才能运行.在此示例中,我使用10'000,因此我可以在堆栈上分配它,但是我真的希望有1千万.

time tells me that this program takes about 2.9 seconds to run. I use 10'000 in this example, so I can allocate that on the stack, but I really want one with 10 million.

现在考虑使用同一程序,但改为使用Vec<T>:

Now consider the same program but with Vec<T> instead:

fn main() {
    const LENGTH: usize = 10_000;
    let mut a: Vec<i32> = vec![0; LENGTH];
    for j in 0..LENGTH {
        for i in 0..LENGTH {
            a[i] = j as i32;
        }
    }
}

time告诉我该程序大约需要5秒钟才能运行.现在time并不十分精确,但是对于这样一个简单的程序,相差约2秒并不是微不足道的影响.

time tells me that this program takes about 5 seconds to run. Now time isn't super exact, but the difference of about 2 seconds for such a simple program is not an insignificant impact.

存储就是存储,装箱阵列时,带有阵列的程序也一样快.因此,不是Vec<T>版本变慢的堆,而是Vec<T>结构本身.

Storage is storage, the program with array is just as fast when the array is boxed. So it's not the heap slowing the Vec<T> version down, but the Vec<T> structure itself.

我还尝试过使用HashMap(特别是HashMap<usize, i32>来模仿数组结构),但这比Vec<T>解决方案要慢得多.

I also tried with a HashMap (specifically HashMap<usize, i32> to mimic an array structure), but that's far slower than the Vec<T> solution.

如果我的LENGTH是1000万,第一个版本甚至都无法运行.

If my LENGTH had been 10 million, the first version wouldn't even have run.

如果这不可能,那么堆上是否有一个行为类似于数组(和Vec<T>)的结构,但可以与数组的速度和性能相匹配?

If that's not possible, is there a structure that behaves like an array (and Vec<T>) on the heap, but can match the speed and performance of an array?

推荐答案

摘要:您的基准测试存在缺陷;只需使用Vec(如此处所述),可能使用 into_boxed_slice ,因为它不可能比堆分配的数组慢.

Summary: your benchmark is flawed; just use a Vec (as described here), possibly with into_boxed_slice, as it is incredibly unlikely to be slower than a heap allocated array.

不幸的是,您的基准测试存在缺陷.首先,您可能没有进行优化编译(对于货物,--release,对于rustc,-O).因为如果有的话,Rust编译器将删除所有代码.请参见此处的程序集.为什么?因为您从没有观察到向量/数组,所以不需要一开始就做所有的工作.

Unfortunately, your benchmarks are flawed. First of all, you probably didn't compile with optimizations (--release for cargo, -O for rustc). Because if you would have, the Rust compiler would have removed all of your code. See the assembly here. Why? Because you never observe the vector/array, so there is no need to do all that work in the first place.

此外,您的基准测试并未测试您实际要测试的内容.您正在将堆栈分配的数组与堆分配的向量进行比较.您应该将Vec与堆分配的数组进行比较.

Also, your benchmark is not testing what you actually want to test. You are comparing an stack-allocated array with a heap-allocated vector. You should compare the Vec to a heap allocated array.

不过,不要感到难过:由于许多原因,编写基准测试非常困难.请记住:如果您对编写基准不了解很多,最好不要不先问别人就相信自己的基准.

Don't feel bad though: writing benchmarks is incredible hard for many reasons. Just remember: if you don't know a lot about writing benchmarks, better don't trust your own benchmarks without asking others first.

我确定了您的基准,并包括了所有三种可能性:Vec,堆栈上的数组和堆上的数组.您可以在此处找到完整的代码.结果是:

I fixed your benchmark and included all three possibilities: Vec, array on stack and array on heap. You can find the full code here. The results are:

running 3 tests
test array_heap  ... bench:   9,600,979 ns/iter (+/- 1,438,433)
test array_stack ... bench:   9,232,623 ns/iter (+/- 720,699)
test vec_heap    ... bench:   9,259,912 ns/iter (+/- 691,095)

惊喜:版本之间的差异远小于度量值的差异.因此,我们可以假设它们的速度都相当快.

Surprise: the difference between the versions are way less than the variance of the measurement. So we can assume they all are pretty equally fast.

请注意,该基准测试仍然还很糟糕.只需将所有数组元素设置为LENGTH - 1的一个循环即可替换这两个循环.通过快速查看组装(以及相当长的9ms时间),我认为LLVM不够智能,无法实际执行此优化.但是这样的事情很重要,应该意识到这一点.

Note that this benchmark is still pretty bad. The two loops can just be replaced by one loop setting all array elements to LENGTH - 1. From taking a quick look at the assembly (and from the rather long time of 9ms), I think that LLVM is not smart enough to actually perform this optimization. But things like this are important and one should be aware of that.

最后,让我们讨论为什么两个解决方案都应该同样快,以及速度实际上是否存在差异.

Finally, let's discuss why both solutions should be equally fast and whether there are actually differences in speed.

Vec<T>的数据部分具有与[T]完全相同的内存布局:内存中连续只有许多T.超级简单.这也意味着它们都表现出相同的缓存行为(特别是对缓存非常友好).

The data section of a Vec<T> has exactly the same memory layout as a [T]: just many Ts contiguously in memory. Super simple. This also means both exhibit the same caching-behavior (specifically, being very cache-friendly).

唯一的区别是Vec的容量可能比元素大.因此,Vec本身存储(pointer, length, capacity).这比一个简单的(带盒装的)片(存储(pointer, length))多了一个字.装箱的数组不需要存储长度,因为它已经在类型中,因此它只是一个简单的指针.当您将拥有数百万个元素时,是否存储一个,两个或三个单词并不重要.

The only difference is that a Vec might have more capacity than elements. So Vec itself stores (pointer, length, capacity). That is one word more than a simple (boxed) slice (which stores (pointer, length)). A boxed array doesn't need to store the length, as it's already in the type, so it is just a simple pointer. Whether or not we store one, two or three words is not really important when you will have millions of elements anyway.

访问所有三个元素都相同:我们先进行边界检查,然后通过base_pointer + size_of::<T>() * index计算目标指针.但是必须注意,将长度存储在类型中的数组意味着优化器可以更轻松地删除边界检查!这可能是真正的优势.

Accessing one element is the same for all three: we do a bounds check first and then calculate the target pointer via base_pointer + size_of::<T>() * index. But it's important to note that the array storing its length in the type means that the bounds check can be removed more easily by the optimizer! This can be a real advantage.

但是,智能优化器通常已经删除了边界检查.在我上面发布的基准代码中,程序集中没有边界检查.因此,尽管通过取消边界检查可以使装箱的数组快一些,但(a)这将是次要的性能差异,并且(b)在很多情况下,为数组删除边界检查的可能性很小,但是不适用于Vec/切片.

However, bounds checks are already usually removed by the smart optimizer. In the benchmark code I posted above, there are no bounds checks in the assembly. So while a boxed array could be a bit faster by removed bounds checks, (a) this will be a minor performance difference and (b) it's very unlikely that you will have a lot of situations where the bound check is removed for the array but not for the Vec/slice.

这篇关于有什么方法可以直接在堆上分配标准的Rust数组,完全跳过堆栈吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆