如何避免在heapq中使用_siftup或_siftdown [英] how to avoid using _siftup or _siftdown in heapq

查看:67
本文介绍了如何避免在heapq中使用_siftup或_siftdown的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道如何在不使用_siftup_siftdown的情况下有效解决以下问题:

I have no idea how to solve following problem efficiently without using _siftup or _siftdown:

当一个元素发生故障时,如何恢复堆不变式?

How to restore the heap invariant, when one element is out-of-order?

换句话说,将heap中的old_value更新为new_value,并保持heap工作.您可以假设堆中只有一个old_value.功能定义如下:

In other words, update old_value in heap to new_value, and keep heap working. you can assume there is only one old_value in heap. The fucntion definition is like:

def update_value_in_heap(heap, old_value, new_value):

这是我的真实情况,如果您有兴趣请阅读.

Here is my real scenario, read it if you are interested in.

  • 您可以想象这是一个小的自动完成系统.我要数 单词的频率,并保持前k个最大数量的单词,其中 准备随时输出.所以我在这里使用heap.当一个字 count ++,如果它在堆中,我需要对其进行更新.

  • You can imagine it is a small autocomplete system. I need to count the frequency of words, and maintain the top k max-count words, which prepare to output at any moment. So I use heap here. When one word count++, I need update it if it is in heap.

所有单词和计数都存储在trie-tree的叶子和堆中
存储在trie-tree的中间节点中.如果您在乎这个词
堆满了,不用担心,我可以从trie-tree的叶子节点上获取它.

All the words and counts are stored in trie-tree's leaf, and heaps
are stored in trie-tree's middle nodes. If you care about the word
out of heap, don't worry, I can get it from trie-tree's leaf node.

当用户键入单词时,它将首先从堆中读取,然后更新
它.为了获得更好的性能,我们可以考虑减少更新频率 通过批量更新.

when user type a word, it will first read from heap and then update
it. For better performance, we can consider decrease update frequency by updated in batch.

那么当一个特定的字数增加时,如何更新堆?

So how to update the heap, when one particular word count increase?

这是_siftup或_siftdown版本的简单示例(不是我的情况):

Here is _siftup or _siftdown version simple example(not my scenario):

>>> from heapq import _siftup, _siftdown, heapify, heappop

>>> data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]
>>> heapify(data)
>>> old, new = 8, 22              # increase the 8 to 22
>>> i = data.index(old)
>>> data[i] = new
>>> _siftup(data, i)
>>> [heappop(data) for i in range(len(data))]
[1, 2, 3, 5, 7, 10, 18, 19, 22, 37]

>>> data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]
>>> heapify(data)
>>> old, new = 8, 4              # decrease the 8 to 4
>>> i = data.index(old)
>>> data[i] = new
>>> _siftdown(data, 0, i)
>>> [heappop(data) for i in range(len(data))]
[1, 2, 3, 4, 5, 7, 10, 18, 19, 37]

花费O(n)进行索引并花费O(logn)进行更新. heapify是另一种解决方案,但是 效率低于_siftup_siftdown.

it costs O(n) to index and O(logn) to update. heapify is another solution, but less efficient than _siftup or _siftdown.

但是_siftup_siftdown在heapq中是受保护的成员,因此不建议从外部进行访问.

But _siftup and _siftdown are protected member in heapq, so they are not recommended to access from outside.

那么,有没有更好,更有效的方法来解决此问题?这种情况的最佳做法?

So is there a better and more efficient way to solve this problem? Best practice for this situation?

感谢您的阅读,我非常感谢它对我的帮助. :)

Thanks for reading, I really appreciate it to help me out. : )

已经参考 heapq python-如何修改堆排序的值,但我的问题没有答案

already refer to heapq python - how to modify values for which heap is sorted, but no answer to my problem

推荐答案

TL; DR 使用heapify.

您必须牢记的重要一件事是,理论复杂性和性能是两个不同的事物(即使它们是相关的).换句话说,实现也很重要.渐近复杂度为您提供了一些下界,您可以将其视为保证,例如O(n)中的算法可确保在最坏的情况下,您将执行一些在输入大小.这里有两件重要的事情:

One important thing that you have to keep in mind is that theoretical complexity and performances are two different things (even though they are related). In other words, implementation does matter too. Asymptotic complexities give you some lower bounds that you can see as guarantees, for example an algorithm in O(n) ensure that in the worst case scenario, you will execute a number of instructions that is linear in the input size. There are two important things here:

  1. 常量被忽略,但常量在现实生活中很重要;
  2. 最坏的情况取决于您考虑的算法,而不仅取决于输入.

根据要考虑的主题/问题,第一点可能非常重要.在某些域中,隐藏在渐近复杂性中的常量是如此之大,以至于您甚至无法建立比常量大的输入(或者认为输入不现实).这里不是这种情况,但这是您始终要记住的事情.

Depending on the topic/problem you consider, the first point can be very important. In some domains, constants hidden in asymptotic complexities are so big that you can't even build inputs that are bigger than the constants (or that input wouldn't be realistic to consider). That's not the case here, but that's something you always have to keep in mind.

给出这两个观察结果,您不能真正说出:实现B比A快,因为A源自O(n)算法,而B源自O(log n)算法.即使从总体上来说这是一个很好的论据,但它并不总是足够的.当所有输入均等可能发生时,理论上的复杂性对于比较算法特别有用.换句话说,当您的算法非常通用时.

Giving these two observations, you can't really say: implementation B is faster than A because A is derived from a O(n) algorithm and B is derived from a O(log n) algorithm. Even if that's a good argument to start with in general, it's not always sufficient. Theoretical complexities are especially good for comparing algorithms when all inputs are equally likely to happen. In other words, when you algorithms are very generic.

如果您知道用例和输入将是什么,则可以直接测试性能.同时使用测试和渐进复杂性,可以使您很好地了解算法的性能(在极端情况和任意实际情况下).

In the case where you know what your use cases and inputs will be you can just test for performances directly. Using both the tests and the asymptotic complexity will give you a good idea on how your algorithm will perform (in both extreme cases and arbitrary practical cases).

话虽如此,让我们在下面的类上运行一些性能测试,这些类将实现三种不同的策略(实际上有这里有四种策略,但是 Invalidate and Reinsert 在您的情况下似乎不正确,因为您将在看到给定单词的情况下使每个项目无效的次数.我将包括大部分代码,以便您可以再次检查自己是否搞砸了(甚至可以检查

That being said, lets run some performance tests on the following class that will implement three different strategies (there are actually four strategies here, but Invalidate and Reinsert doesn't seem right in your case as you'll invalidate each item as many time as you see a given word). I'll include most of my code so you can double check that I haven't messed up (you can even check the complete notebook):

from heapq import _siftup, _siftdown, heapify, heappop

class Heap(list):
  def __init__(self, values, sort=False, heap=False):
    super().__init__(values)
    heapify(self)
    self._broken = False
    self.sort = sort
    self.heap = heap or not sort

  # Solution 1) repair using the knowledge we have after every update:        
  def update(self, key, value):
    old, self[key] = self[key], value
    if value > old:
        _siftup(self, key)
    else:
        _siftdown(self, 0, key)
    
  # Solution 2 and 3) repair using sort/heapify in a lazzy way:
  def __setitem__(self, key, value):
    super().__setitem__(key, value)
    self._broken = True
    
  def __getitem__(self, key):
    if self._broken:
        self._repair()
        self._broken = False
    return super().__getitem__(key)

  def _repair(self):  
    if self.sort:
        self.sort()
    elif self.heap:
        heapify(self)

  # … you'll also need to delegate all other heap functions, for example:
  def pop(self):
    self._repair()
    return heappop(self)

我们首先要检查这三种方法是否都可以工作:

We can first check that all three methods work:

data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]

heap = Heap(data[:])
heap.update(8, 22)
heap.update(7, 4)
print(heap)

heap = Heap(data[:], sort_fix=True)
heap[8] = 22
heap[7] = 4
print(heap)

heap = Heap(data[:], heap_fix=True)
heap[8] = 22
heap[7] = 4
print(heap)

然后,我们可以使用以下功能运行一些性能测试:

Then we can run some performance tests using the following functions:

import time
import random

def rand_update(heap, lazzy_fix=False, **kwargs):
    index = random.randint(0, len(heap)-1)
    new_value = random.randint(max_int+1, max_int*2)
    if lazzy_fix:
        heap[index] = new_value
    else:
        heap.update(index, new_value)
    
def rand_updates(n, heap, lazzy_fix=False, **kwargs):
    for _ in range(n):
        rand_update(heap, lazzy_fix)
        
def run_perf_test(n, data, **kwargs):
    test_heap = Heap(data[:], **kwargs)
    t0 = time.time()
    rand_updates(n, test_heap, **kwargs)
    test_heap[0]
    return (time.time() - t0)*1e3

results = []
max_int = 500
nb_updates = 1

for i in range(3, 7):
    test_size = 10**i
    test_data = [random.randint(0, max_int) for _ in range(test_size)]

    perf = run_perf_test(nb_updates, test_data)
    results.append((test_size, "update", perf))
    
    perf = run_perf_test(nb_updates, test_data, lazzy_fix=True, heap_fix=True)
    results.append((test_size, "heapify", perf))

    perf = run_perf_test(nb_updates, test_data, lazzy_fix=True, sort_fix=True)
    results.append((test_size, "sort", perf))

结果如下:

import pandas as pd
import seaborn as sns

dtf = pd.DataFrame(results, columns=["heap size", "method", "duration (ms)"])
print(dtf)

sns.lineplot(
    data=dtf, 
    x="heap size", 
    y="duration (ms)", 
    hue="method",
)

从这些测试中,我们可以看到heapify似乎是最合理的选择,在最坏的情况下它具有不错的复杂度:O(n)并且在实践中表现更好.另一方面,研究其他选项(例如具有专用于该特定问题的数据结构,例如,使用bin将单词放入其中,然后将它们从bin移至下一个看起来像是一条可能的轨迹)可能是个好主意.调查).

From these tests we can see that heapify seems like the most reasonable choice, it has a decent complexity in the worst case: O(n) and perform better in practice. On the other hand, it's probably a good idea to investigate other options (like having a data structure dedicated to that particular problem, for example using bins to drop words into, then moving them from a bin to the next look like a possible track to investigate).

重要说明:这种情况(更新与阅读比率为1:1)对于heapifysort解决方案都是不利的.因此,如果您设法使比率为k:1,则该结论将更加清楚(您可以在上面的代码中将nb_updates = 1替换为nb_updates = k).

Important remark: this scenario (updating vs. reading ratio of 1:1) is unfavorable to both the heapify and sort solutions. So if you manage to have a k:1 ratio, this conclusion will be even clearer (you can replace nb_updates = 1 with nb_updates = k in the above code).

数据框详细信息:

    heap size   method  duration in ms
0        1000   update        0.435114
1        1000  heapify        0.073195
2        1000     sort        0.101089
3       10000   update        1.668930
4       10000  heapify        0.480175
5       10000     sort        1.151085
6      100000   update       13.194084
7      100000  heapify        4.875898
8      100000     sort       11.922121
9     1000000   update      153.587103
10    1000000  heapify       51.237106
11    1000000     sort      145.306110

这篇关于如何避免在heapq中使用_siftup或_siftdown的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆