有效实施二进制堆 [英] Efficient implementation of binary heaps

查看:144
本文介绍了有效实施二进制堆的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找有效的二进制堆的信息。我觉得应该有一个很好的文章,有效地实施堆,但我没有找到一个。事实上,除了将堆存储在数组中的基础之外,我无法找到有关高效实现的任何资源。我正在寻找一种比以下描述的快二进制堆的技术。



我已经编写了一个比Microsoft Visual C ++更快的C ++实现, GCC的std :: priority_queue或使用std :: make_heap,std :: push_heap和std :: pop_heap。以下是我在实现中已经涵盖的技术。我只想到了最后一个2,虽然我怀疑那些是新的想法:



(编辑:内存优化的添加部分)



  • 开始索引1

    看看维基百科实施说明二进制堆。如果堆的根被放置在索引0,则索引n处的节点的父,左子节点和右子节点的公式分别为(n-1)/ 2,2n + 1和2n + 2。如果您使用基于1的数组,则公式将变为更简单的n / 2,2n和2n + 1。因此,使用基于1的数组时,父和左子更有效。如果p指向基于0的数组,并且q = p-1,那么我们可以访问p [0]作为q [1],所以在使用基于1的数组时没有开销。



  • 将元素移动到堆的底部,然后替换为叶

    堆上的Pop经常被替换为顶部元素最左边的叶子,然后将其移动到堆属性恢复。这需要我们每个级别的2次比较,我们可能会走得很远,因为我们将一个叶子移动到堆的顶部。所以我们应该期待一个小于2 log n的比较。



    相反,我们可以在顶部元素的堆中留下一个洞。然后,我们通过反复移动较大的孩子将这个洞向下移动。这只需要我们过去的每个级别的1次比较。以这种方式,洞将成为一片叶子。此时,我们可以将最右边的底部叶片移动到孔的位置,并将该值向上移动,直到堆属性恢复为止。由于我们移动的价值是一片叶子,我们不希望它在树上移动得很远。所以我们应该期望比日志比较更好一点,这比以前更好。



  • 支持替换顶部

    假设要删除max元素,并插入一个新元素。然后,您可以执行上述的删除/弹出实现,而不是移动最右边的底部叶,您可以使用要插入/推送的新值。 (当大多数操作都是这样的时候,我发现一个锦标赛的树比一个堆更好,但是堆更好一些。)



  • < b>使sizeof(T)为2的幂

    父,左子和右子公式在索引上工作,不能直接在指针值上工作。所以我们要使用索引,这意味着从索引i中查找数组p中的值p [i]。如果p是T *,i是整数,则

     &(p [i])== static_cast< char *(p)+ sizeof(T)* i 

    ,编译器必须执行此计算得到p [i]。 sizeof(T)是编译时常数,如果sizeof(T)是2的幂,则乘法可以更有效地进行。我的实现通过添加8个填充字节来加快,将sizeof(T)从24增加到32.缓存的效率降低可能意味着这不足以获得足够大的数据集。



  • 预乘数索引

    我的数据集的性能提高了23%。除了找到父,左子和右子之外,我们所做的唯一的一个索引是在数组中查找索引。因此,如果我们跟踪j = sizeof(T)* i而不是索引i,那么我们可以执行查找p [i],而不会在评估p [i]时隐含的乘法,因为

     &(p [i])== static_cast< char *>(p)+ sizeof(T)* i == static_cast< char * &p(p)+ j 

    然后j值的左子代和右子公式分别成为2 * j和2 * j +大小(T)。父配方有点棘手,我没有找到一种办法,除了将j值转换为i值,如下所示:

      parentOnJ(j)= parent(j / sizeof(T))* sizeof(T)==(j /(2 * sizeof(T))* sizeof(T)

    如果sizeof(T)是2的幂,那么这将编译为2位,这是1个操作更多比普通的父母使用索引i。但是我们然后在查找上保存1个操作,所以净效果是找到父代使用相同的时间量,而查看左边的孩子和右边的孩子变得更快。 p>

  • 内存优化



    TokenMacGuy和templatetypedef的答案指出了内存减少高速缓存未命中的优化对于非常大的数据集或不常使用的优先级队列,可以通过操作系统将队列的一部分交换到磁盘,在这种情况下,值得添加大量的开销以最佳地使用因为交换缓存从磁盘入门很慢。我的数据很容易适应内存,并且是持续使用的,所以没有一部分队列可能被交换到磁盘。我怀疑大多数使用优先级队列的情况是这样。



    还有其他优先级队列旨在更好地利用CPU缓存。例如,4堆应该具有更少的缓存未命中,并且额外的开销不是那么多。 1996年的 LaMarca和Ladner 报告他们得到从进行排列的4堆,75%的性能提升。不过, Hendriks 2010年报告显示:还测试了LaMarca和Ladner [17]提出的改进数据局部性并减少缓存未命中的隐式堆的改进。我们实现了一个四向堆,这对于非常偏斜的输入数据来说确实显示出比双向堆稍微更好的一致性,但是对于非常大的队列大小而言。非常大的队列大小更好地由层次堆处理。


  • 问题

    有比这些更多的技术吗?

    解决方案

    这个主题的一篇有趣的论文/文章考虑了缓存/分页的行为对堆的整体布局;这个想法是,与数据结构实施中的任何其他部分相比,支付高速缓存未命中或页面成本要高得多。本文讨论了解决此问题的堆布局。



    你正在做的错误
    由Poul-Henning Kamp


    I'm looking for information on how to implement binary heaps efficiently. I feel like there should be a nice article somewhere about implementing heaps efficiently, but I haven't found one. In fact I've been unable to find any resources on the matter of efficient implementation beyond the basics like storing the heap in an array. I'm looking for techniques for making a fast binary heap beyond the ones I describe below.

    I've already written a C++ implementation that is faster than Microsoft Visual C++'s and GCC's std::priority_queue or using std::make_heap, std::push_heap and std::pop_heap. The following are the techniques I've already got covered in my implementation. I only came up with the last 2 myself, though I doubt that those are new ideas:

    (Edit: added section on memory optimization)

  • Start indexes at 1
    Look at the Wikipedia implementation notes for binary heaps. If the heap's root is placed at index 0, then the formulas for parent, left-child and right-child of the node at index n are respectively (n-1)/2, 2n+1 and 2n+2. If you use a 1-based array then the formulas become the simpler n/2, 2n and 2n + 1. So parent and left-child are more efficient when using a 1-based array. If p points to a 0-based array and q = p - 1 then we can access p[0] as q[1] so there is no overhead in using a 1-based array.

  • Make pop/removal move element to bottom of heap before replacement with leaf
    Pop on a heap is frequently described by replacing the top element by the left-most bottom leaf and then moving it down until the heap property is restored. This requires 2 comparisons per level that we go by, and we are likely to go far down the heap since we moved a leaf to the top of the heap. So we should expect a little less than 2 log n comparisons.

    Instead, we can leave a hole in the heap where the top element was. Then we move that hole down the heap by iteratively moving the larger child up. This requires only 1 comparison per level that we go past. In this way the hole will become a leaf. At this point we can move the right-most bottom leaf into the position of the hole and move that value up until the heap property is restored. Since the value we moved was a leaf we don't expect it to move very far up the tree. So we should expect a little more than log n comparisons, which is better than before.

  • Support replace-top
    Suppose you want to remove the max element and also insert a new element. Then you can do either of the removal/pop implementations described above, but instead of moving the right-most bottom leaf, you use the new value you wish to insert/push. (When most operations are of this kind I've found that a tournament tree is better than a heap, but otherwise the heap is slightly better.)

  • Make sizeof(T) a power of 2
    The parent, left-child and right-child formulas work on indexes and they cannot be made to work directly on pointer values. So we are going to be working with indexes and that implies looking up values p[i] in an array p from an index i. If p is a T* and i is an integer, then

    &(p[i]) == static_cast<char*>(p) + sizeof(T) * i
    

    and the compiler has to perform this computation to get p[i]. sizeof(T) is a compile-time constant, and the multiplication can be done more efficiently if sizeof(T) is a power of two. My implementation got faster by adding 8 padding bytes to increase sizeof(T) from 24 to 32. Reduced efficiency of the cache probably means that this is not a win for sufficiently large data sets.

  • Pre-multiply indexes
    This was a 23% performance increase on my data set. The only thing we ever do with an index other than finding parent, left-child and right-child is to look the index up in an array. So if we keep track of j = sizeof(T) * i instead of an index i, then we could do a lookup p[i] without the multiplication that is otherwise implicit in evaluating p[i] because

    &(p[i]) == static_cast<char*>(p) + sizeof(T) * i == static_cast<char*>(p) + j
    

    Then the left-child and right-child formulas for j-values become respectively 2*j and 2*j + sizeof(T). The parent formula is a little more tricky, and I haven't found a way to do it other than converting the j-value to an i-value and back like so:

    parentOnJ(j) = parent(j/sizeof(T))*sizeof(T) == (j/(2*sizeof(T))*sizeof(T)
    

    If sizeof(T) is a power of 2 then this will compile to 2 shifts. That is 1 operation more than the usual parent using indexes i. However we then save 1 operation on lookup. So the net effect is that finding the parent takes the same amount of time this way, while lookup of left-child and right-child becomes faster.

  • Memory optimization

    The answers of TokenMacGuy and templatetypedef point out memory based optimizations that reduce cache misses. For very large data sets or infrequently used priority queues, parts of the queue can be swapped out to disk by the OS. In that case it is worth it to add a lot of overhead to make optimal use of the cache because swapping in from disk is very slow. My data easily fits in memory and is in continuous use, so no part of the queue will likely be swapped to disk. I suspect that this is the case for most uses of priority queues.

    There are other priority queues that are designed to make better use of the CPU cache. For example a 4-heap should have fewer cache misses and the amount of extra overhead is not that much. LaMarca and Ladner report in 1996 that they get 75% performance improvement from going to aligned 4-heaps. However, Hendriks reports in 2010 that:

    The improvements to the implicit heap suggested by LaMarca and Ladner [17] to improve data locality and reduce cache misses were also tested. We implemented a four-way heap, that indeed shows a slightly better consistency than the two-way heap for very skewed input data, but only for very large queue sizes. Very large queue sizes are better handled by the hierarchical heap.

  • Question
    Are there more techniques than these?

    解决方案

    An interesting paper/article on this topic considers the behavior of caching/paging on the overall layout of the heap; The idea being that it's vastly more costly to pay for a cache miss or page in than nearly any other part of a datastructure's implementation. The paper discusses a heap layout that addresses this.

    You're Doing It Wrong by Poul-Henning Kamp

    这篇关于有效实施二进制堆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆