std :: vector中的每个元素访问是否都是缓存未命中? [英] Is every element access in std::vector a cache miss?

查看:156
本文介绍了std :: vector中的每个元素访问是否都是缓存未命中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

众所周知,std::vector将其数据保存在堆上,因此向量本身的实例和第一个元素具有不同的地址.另一方面,std::array是围绕原始数组的轻量级包装,其地址等于第一个元素的地址.

It's known that std::vector hold its data on the heap so the instance of the vector itself and the first element have different addresses. On the other hand, std::array is a lightweight wrapper around a raw array and its address is equal to the first element's address.

让我们假设集合的大小足够容纳一个int32缓存行.在我的具有384kB L1缓存的计算机上,它是98304数字.

Let's assume that the sizes of collections is big enough to hold one cache line of int32. On my machine with 384kB L1 cache it's 98304 numbers.

如果我迭代std::vector,事实证明,我总是首先访问向量本身的地址,然后访问下一个访问元素的地址.可能此地址不在同一高速缓存行中.因此,每个元素访问都是缓存未命中.

If I iterate the std::vector it turns out that I always access first the address of the vector itself and next access element's address. And probably this addresses are not in the same cache line. So every element access is a cache miss.

但是如果我迭代std::array地址在同一缓存行中,那么它应该更快.

But if I iterate std::array addresses are in the same cache line so it should be faster.

我在VS2013上进行了全面优化测试,std::array的速度提高了约20%.

I tested with VS2013 with full optimization and std::array is approx 20% faster.

我的假设对吗?

更新:为了不创建第二个类似的主题.在这段代码中,我有一个数组和一些局部变量:

Update: in order to not create the second similar topic. In this code I have an array and some local variable:

void test(array<int, 10>& arr)
{
    int m{ 42 };

    for (int i{ 0 }; i < arr.size(); ++i)
    {
        arr[i] = i * m;
    }
}

在循环中,我要访问数组和堆栈变量,它们在内存中的位置彼此远离.这是否意味着每次迭代我都会访问不同的内存并错过缓存?

In the loop I'm accessing both an array and a stack variable which are placed far from each other in memory. Does that mean that every iteration I'll access different memory and miss the cache?

推荐答案

您说的许多事情都是正确的,但我不认为您会以自己认为的速度看到缓存未命中.相反,我认为您会看到编译器优化的其他效果.

Many of the things you've said are correct, but I do not believe that you're seeing cache misses at the rate that you believe you are. Rather, I think you're seeing other effects of compiler optimizations.

正确的是,当您在std::vector中查找元素时,有两个内存读取:首先,对于指向元素的指针的内存读取.第二,读取元素本身的内存.但是,如果您对std::vector进行了多个顺序读取,则很可能您第一次读取的内容将在元素上发生高速缓存未命中,但是所有连续的读取要么在高速缓存中,要么是不可避免的.内存高速缓存针对引用的位置进行了优化,因此,每当将单个地址拉入高速缓存时,大量相邻内存地址也将被拉入高速缓存.结果,如果您遍历std::vector的元素,则大多数时候根本不会发生任何高速缓存未命中的情况.该性能应该看起来与常规阵列的性能非常相似.还值得记住的是,缓存存储了多个不同的内存位置,而不仅仅是一个,因此您正在读取堆栈中的某些内容(std::vector内部指针)和堆中的某些内容(元素),或者两个堆栈中的不同元素,不会立即导致缓存未命中.

You are right that when you look up an element in a std::vector, that there are two memory reads: first, a memory read for the pointer to the elements; second, a memory read for the element itself. However, if you do multiple sequential reads on the std::vector, then chances are that the very first read you do will have a cache miss on the elements, but all successive reads will either be in cache or be unavoidable. Memory caches are optimized for locality of reference, so whenever a single address is pulled into cache a large number of adjacent memory addresses are pulled into the cache as well. As a result, if you iterate over the elements of a std::vector, most of the time you won't have any cache misses at all. The performance should look quite similar to that for a regular array. It's also worth remembering that the cache stores multiple different memory locations, not just one, so the fact that you're reading both something on the stack (the std::vector internal pointer) and something in the heap (the elements), or two different elements on the stack, won't immediately cause a cache miss.

需要记住的是,与高速缓存命中相比,高速缓存未命中的代价非常高(通常会慢10倍),因此,如果您确实在std::vector的每个元素上看到高速缓存未命中,就不会看不到性能差距只有20%.您会看到更接近2倍或更大性能差距的东西.

Something to keep in mind is that cache misses are extremely expensive compared to cache hits - often 10x slower - so if you were indeed seeing a cache miss on each element of the std::vector you wouldn't see a gap of only 20% in performance. You'd see something a lot closer to a 2x or greater performance gap.

那么,为什么您会看到性能差异?您尚未考虑的一大因素是,如果使用std::array<int, 10>,则编译器可以在编译时告知该数组中恰好有十个元素,并且可以展开或优化循环,您必须消除不必要的检查.实际上,编译器原则上可以用10个顺序的代码块替换循环,这些代码块都写入特定的数组元素,这可能比在循环中向后反复分支要快得多.另一方面,对于使用std::vector的等效代码,编译器无法始终提前知道循环将运行多少次,因此,它可能无法生成与为其生成的代码一样好的代码.数组.

So why, then, are you seeing a difference in performance? One big factor that you haven't yet accounted for is that if you use a std::array<int, 10>, then the compiler can tell at compile-time that the array has exactly ten elements in it and can unroll or otherwise optimize the loop you have to eliminate unnecessary checks. In fact, the compiler could in principle replace the loop with 10 sequential blocks of code that all write to a specific array element, which might be a lot faster than repeatedly branching backwards in the loop. On the other hand, with equivalent code that uses std::vector, the compiler can't always know in advance how many times the loop will run, so chances are it can't generate code that's as good as the code it generated for the array.

那么事实是,您在此处编写的代码是如此之小,以至于任何计时的尝试都会产生大量噪音.很难评估这种方法的可靠程度,因为与冷"运行该方法相比,将其放入for循环这样简单的操作只会使高速缓存的行为混乱.

Then there's the fact that the code you've written here is so small that any attempt to time it is going to have a ton of noise. It would be difficult to assess how fast this is reliably, since something as simple as just putting it into a for loop would mess up the cache behavior compared to a "cold" run of the method.

总的来说,我不会将其归因于缓存未命中,因为我怀疑它们的数量明显不同.相反,我认为这是对大小静态已知的数组的编译器优化,而对std::vector大小只能动态知道的数组的优化.

Overall, I wouldn't attribute this to cache misses, since I doubt there's any appreciably different number of them. Rather, I think this is compiler optimization on arrays whose sizes are known statically compared with optimization on std::vectors whose sizes can only be known dynamically.

这篇关于std :: vector中的每个元素访问是否都是缓存未命中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆