提高:: flat_map,其性能相比,地图和unordered_map [英] boost::flat_map and its performance compared to map and unordered_map

查看:1153
本文介绍了提高:: flat_map,其性能相比,地图和unordered_map的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是编程常识,内存局部性提高性能很多,由于缓存命中率。我最近发现了的boost :: flat_map 这是一个基于矢量的实现地图的。这似乎并没有被几乎一样典型的地图受欢迎 / unordered_map ,所以我一直没能发现任何性能比较。它是如何比较和什么是最好的用例呢?

It is common knowledge in programming that memory locality improves performance a lot due to cache hits. I recently found out about boost::flat_map which is a vector based implementation of a map. It doesn't seem to be nearly as popular as your typical map/unordered_map so I haven't been able to find any performance comparisons. How does it compare and what are the best use cases for it?

谢谢!

推荐答案

我已经在我的公司运行在不同的数据结构基准最近,所以我觉得我需要删除一个字。这是非常复杂,要正确基准的东西。

I have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.

在网络上,我们很少能找到(如果有的话),一个精心设计的基准。直到今天我只是发现了做记者的方式基准(pretty迅速,在地毯下席卷几十个变量)。

On the web we rarely find (if ever) a well engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).

1)您需要考虑有关高速缓存预热

1) You need to consider about cache warming

运行基准大多数人都害怕定时器差异,因此他们跑他们的东西数千次,并取整的时候,他们只是细心地时的同一千元每一个操作,然后再考虑可比性。

Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.

事实上,在现实世界中没有什么意义,因为你的缓存不会是温暖的,和你的操作可能会被只调用一次。因此,您需要使用RDTSC基准,而时间只有东西叫他们一次。
英特尔做了一个纸<一个href=\"http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-$c$c-execution-paper.pdf\">describing如何使用RDTSC(使用CPUID指令,冲洗管路,并调用它的至少3倍的程序的开始,以稳定它)。

The truth is, in real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only. Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).

2) RDTSC精度测量

2) RDTSC accuracy measure

我也建议这样做:

u64 g_correctionFactor;  // number of clocks to offset after each measurement to remove the overhead of the measurer itself.
u64 g_accuracy;

static u64 const errormeasure = ~((u64)0);

#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // flush OOO instruction pipeline
    return __rdtsc();
}

inline void WarmupRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // warmup cpuid.
    __cpuid(a, 0x80000000);
    __cpuid(a, 0x80000000);

    // measure the measurer overhead with the measurer (crazy he..)
    u64 minDiff = LLONG_MAX;
    u64 maxDiff = 0;   // this is going to help calculate our PRECISION ERROR MARGIN
    for (int i = 0; i < 80; ++i)
    {
        u64 tick1 = GetRDTSC();
        u64 tick2 = GetRDTSC();
        minDiff = Aska::Min(minDiff, tick2 - tick1);   // make many takes, take the smallest that ever come.
        maxDiff = Aska::Max(maxDiff, tick2 - tick1);
    }
    g_correctionFactor = minDiff;

    printf("Correction factor %llu clocks\n", g_correctionFactor);

    g_accuracy = maxDiff - minDiff;
    printf("Measurement Accuracy (in clocks) : %llu\n", g_accuracy);
}
#endif

这是一个差异测量仪,并会采取一切最小测量值,以避免得到时间-10 ** 18(64位第一底片值)时间。

This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid to get a -10**18 (64 bits first negatives values) from time to time.

注意使用内部函数,而不是内联汇编。一是内联汇编很少是由编译器支持时下,但所有的差很多,编译器创建围绕内联汇编一个完整的订货障碍,因为它不能静态分析里面,所以这是基准现实世界的东西出了问题,尤其是刚刚打电话的东西时,一旦。这样的特性在这里适用的,因为它不会破坏编译器的指令自由重新排序

Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real world stuff, especially when calling stuff just once. So an intrinsic is suited here, because it doesn't break the compiler free-re-ordering of instructions.

3)参数

最后的问题就是人们通常测试场景太少了变化。
容器的性能受影响:

The last problem is people usually test for too few variations of the scenario. A container performance is affected by:


  1. 分配器

  2. 包含类型的大小

  3. 执行复制操作,赋值操作,移动操作,施工操作,所含类型的费用。

  4. 在容器中的元素(问题的大小)的数量

  5. 类型有3琐碎的操作

  6. 类型是POD

1点,因为容器会不时分配时间是很重要的,这是相当重要的,如果他们分配使用CRT的新或一些用户自定义操作,如池分配或自由列表或其他...

Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT "new" or some user defined operation, like pool allocation or freelist or other...

有兴趣的约,第1部分,<一个人href=\"http://www.gamedev.net/topic/669459-massive-allocation-perf-difference-between-win8-win7-and-linux/\">join在gamedev 神秘线程有关系统分配器性能的影响的)

(for people interested about pt 1, join the mystery thread on gamedev about system allocator performance impact)

2点是因为一些容器(说)将围绕松散的时间复制的东西,更大的类型更大的开销。问题是,比较到另一个容器B内时,A可以在为小类型赢得B和松散较大类型

Point 2 is because some containers (say A) will loose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and loose for larger types.

3点比点2相同,除了它乘以某些权重因子的成本。

Point 3 is the same than point 2, except it multiplies the cost by some weighting factor.

4点是大O字与高速缓存的问题混合的问题。一些不好的复杂的容器可以大大强于大盘低复杂度的容器少数类型(如地图矢量,因为他们的高速缓存位置是好的,但地图片段内存)。然后在一些交叉点,它们会失去,因为所包含的整体大小开始泄漏到主存储器并引起高速缓存未命中,该加的事实,渐近复杂可以开始感觉到

Point 4 is a question of big O mixed with cache issues. Some bad complexities containers can largely outperform low complexity containers for small number of types (like map vs. vector, because their cache locality is good, but map fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to "leak" to main memory and cause cache misses, that plus the fact that the asymptotic complexity can start to be felt.

5点左右编译器能够逃避的东西,是在编译时为空或微不足道。这可以极大地优化某些操作,因为容器的模板,因此每种类型都会有自己的性能配置。

Point 5 is about compilers being able to elude stuff that are empty or trivial at compile time. This can optimize greatly some operations, because the containers are template, therefore each type will have its own performance profile.

6点一样5点,PODS可以从拷贝构造仅仅是一个memcpy的事实受益,一些容器可以有这些情况下,具体实施时,使用部分模板特,或SFINAE根据的特征选择算法吨。

Point 6 same as point 5, PODS can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.

显然,平面地图是一个排序向量的包装,像洛基AssocVector,但也有一些补充性现代化与C ++ 11的到来,利用移动语义加速插入和单元素删除。

Apparently the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.

这仍然是一个有序的容器。大多数人通常不需要订购的一部分,因此存在无序..

This is still an ordered container. Most people usually don't need the ordering part, therefore the existence of unordered...

你有没有考虑,也许你需要一个 flat_unorderedmap ?这将是类似谷歌:: sparse_map 或类似的东西 - 一个开放地址散列地图

Have you considered that maybe you need a flat_unorderedmap? which would be something like google::sparse_map or something like that - an open address hash map.

开放地址散列映射的问题是,在翻版的时候,他们不得不各地复制到新的扩展平地。当一个标准的无序地图只需要重建散列索引,但分配的数据停留在哪里。当然,缺点是内存碎片像地狱。

The problem of open address hash maps, is that at the time of rehash they have to copy all around to the new extended flat land. When a standard unordered map just have to recreate the hash index, but the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.

的翻版在开放地址散列映射的标准是当容量立交桥乘以负载因数铲斗向量的大小

The criterion of a rehash in an open address hash map is when the capacity overpasses the size of the bucket vector multiplied by the load factor.

一个典型负载系数 0.8 ;因此,你需要关心的是,如果你能填充它之前pre-大小的哈希表,总是pre-尺寸为: intended_filling *(1 / 0.8)+小量这会给你永远不必假性老调重弹和灌装过程中重新复制一切的保证。

A typical load factor is 0.8; therefore, you need to care about that, if you can pre-size your hash map before filling it, always pre-size to: intended_filling * (1/0.8) + epsilon this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.

封闭的地址映射的优势(的std ::无序.. )的是,你不必在意那些参数。

The advantage of closed address maps (std::unordered..) is that you don't have to care about those parameters.

的boost :: flat_map 是一个有序向量;因此,它总是有一个日志(N)渐进复杂,比开放地址散列图(分期常量时间)不太好的。你应该考虑这一点。

But the boost::flat_map is an ordered vector; therefore, it will always have a log(N) asymptotic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.

这是一个涉及不同的地图(使用int键和__int64 / somestruct的值)和std :: vector的测试。

This is a test involving different maps (with int key and __int64/somestruct as value) and std::vector.

测试类型的信息:

typeid=__int64 .  sizeof=8 . ispod=yes
typeid=struct MediumTypePod .  sizeof=184 . ispod=yes

插入

编辑:

我的previous结果包括了一个错误:他们实际测试命令插入,其中展出的平面地图非常快的行为结果。
我离开这些结果后下楼这个页面,因为它们很有趣。结果
这是正确的测试:

My previous results included a bug: they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results later down this page because they are interesting.
This is the correct test:

我已经检查的实施,有一个在平面地图推迟实施排序这里没有这样的事情。在飞行中的每个插入排序,因此该基准表现出渐进的趋势:

I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:

图:N *日志(N)结果
包含HashMap:摊销ñ结果
向量和flatmaps:N * N结果

map : N * log(N)
hashmaps : amortized N
vector and flatmaps : N * N

警告:以下统称的std ::地图和测试都 flat_map 取值是的并实际测试的有序插入:结果

Warning: hereafter the test for std::map and both flat_maps is buggy and actually tests ordered insertion:

我们可以看到,有序插入,导致后面推,而且速度极快。然而,从我的基准非绘制的结果,我也可以说,这不是为附近一回插入绝对最优。在pre-保留的向量得到完美的背部插入最优。这给了我们300万次;我们在这里看到4.8M升压(因此最佳的160%)。

We can see that ordered insertion, results in back pushing, and is extremely fast. However, from non-charted results of my benchmark, I can also say that this is not near the absolute optimality for a back-insertion. Perfect back-insertion optimality is obtained on a pre-reserved vector. Which gives us 3Million cycles; we observe 4.8M here for boost (therefore 160% of the optimal).

随机搜索3个元素(时钟重整化到1)

在大小= 100

在大小= 10000

in size = 10000

迭代

在大小为100(仅MediumPod型)

over size 100 (only MediumPod type)

在大小10000(仅MediumPod型)

over size 10000 (only MediumPod type)

最终晶粒

在最后,我想回来的标杆§3PT1(系统分配)。在最近的实验中href=\"http://sourceforge.net/projects/cgenericopenaddresshashmap/\">开放地址散列地图我开发,我测的多的性能差距我做的std :: unordered_map 用例(<一个Windows 7和Windows 8之间的3000% href=\"http://www.gamedev.net/topic/669459-massive-allocation-perf-difference-between-win8-win7-and-linux/\">discussed这里)。结果
这使我想告诫有关上述结果的读者(他们Win7上作了);所以,你的里程可能会有所不同。

In the end I wanted to come back on "Benchmarking §3 Pt1" (the system allocator). In a recent experiment I am doing around the performance of an open address hash map I develop, I measured a performance gap of more than 3000% between Windows 7 and Windows 8 on some std::unordered_map use cases (discussed here).
Which makes me want to warn the reader about the above results (they were made on Win7); so, your mileage may vary.

问候

这篇关于提高:: flat_map,其性能相比,地图和unordered_map的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆