理想的数据结构,快速查找,快速更新和简单的比较/排序 [英] Ideal data structure with fast lookup, fast update and easy comparison/sorting

查看:170
本文介绍了理想的数据结构,快速查找,快速更新和简单的比较/排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个很好的数据结构来包含一个包含(hash,timestamp)值的元组列表。基本上,我想以下列方式使用它:




  • 数据进来,检查它是否已经存在于数据结构中哈希等级,而不是时间戳)。

  • 如果是,将时间戳记更新为now

  • 如果没有,请将其添加到带有时间戳的集合now



定期删除并返回比特定时间戳更早的元组列表(我需要更新各种其他要素到期时)时间戳不一定是任何特定的(它可以是一个unix时间戳,一个python datetime 对象或一些其他易于比较的哈希/字符串)。 >

我正在使用它来接收传入的数据,如果已经存在并更新它,并清除X秒/分钟以上的数据。



多个数据结构也可以是一个有效的建议(我最初使用优先级队列+集合,但优先级队列不断优化不断更新值)。



同样的事情的其他方法也是欢迎的。最终目标是跟踪系统的元素是新的,b)系统中已经存在,c)当它们到期时。

解决方案

这是一个很好的trod空间。您需要的是两个结构,您需要一些告诉您的密钥(您的情况下的哈希)是集合已知的。为此, dict 是非常适合的;我们将把哈希映射到时间戳,以便轻松查找每个项目。按照时间戳的顺序迭代项目是特别适合堆的任务,由 heapq 模块。每次我们看到一个关键字,我们将它添加到我们的堆中,作为一个元组(timestamp,hash)



不幸的是,没有办法查看堆叠列表并删除某些项目(因为说,他们已经更新,以后到期)。我们将通过忽略堆中具有与dict中的值不同的时间戳的条目来解决这个问题。



所以这里有一个开始的地方,你可以可能添加包装类的方法来支持额外的操作,或者改变数据的存储方式:

  import heapq 


class ExpiringCache(object):
def __init __(self):
self._dict = {}
self._heap = []

def add(self,key,expiry):
self._dict [key] = expiry
heapq.heappush(self._heap,(expiry,key))

def包含(self,key):
返回键在self._dict

def collect(self,maxage):
while self._heap and self._heap [0] [ 0]< = maxage:
expiry,key = heapq.heappop(self._heap)
if self._dict.get(key)== expiry:
del self._dict [key ]

def items(self):
return self._dic t.items()

创建一个缓存并添加一些项目

 >>> xc = ExpiringCache()
>>> xc.add('apples',1)
>>> xc.add('bananas',2)
>>> xc.add('mangoes',3)

重新添加一个甚至晚到期的项目

 >>> xc.add('apples',4)

收集比两个时间单位老 p>

 >>> xc.collect(2)
>>> xc.contains('apples')
True
>>> xc.contains('bananas')
False


I am looking for a good data structure to contain a list of tuples with (hash, timestamp) values. Basically, I want to use it in the following way:

  • Data comes in, check to see if it's already present in the data structure (hash equality, not timestamp).
  • If it is, update the timestamp to "now"
  • If not, add it to the set with timestamp "now"

Periodically, I wish to remove and return a list of tuples that older than a specific timestamp (I need to update various other elements when they 'expire'). Timestamp does not have to be anything specific (it can be a unix timestamp, a python datetime object, or some other easy-to-compare hash/string).

I am using this to receive incoming data, update it if it's already present and purge data older than X seconds/minutes.

Multiple data structures can be a valid suggestion as well (I originally went with a priority queue + set, but a priority queue is less-than-optimal for constantly updating values).

Other approaches to achieve the same thing are welcome as well. The end goal is to track when elements are a) new to the system, b) exist in the system already and c) when they expire.

解决方案

This is a pretty well trod space. The thing you need is two structures, You need something to tell you wether your key (hash in your case) is known to the collection. For this, dict is a very good fit; we'll just map the hash to the timestamp so you can look up each item easily. Iterating over the items in order of timestamp is a task particularly suited to Heaps, which are provided by the heapq module. Each time we see a key, we'll just add it to our heap, as a tuple of (timestamp, hash).

Unfortunately there's no way to look into a heapified list and remove certain items (because, say, they have been updated to expire later). We'll get around that by just ignoring entries in the heap that have timestamps that are dissimilar from the value in the dict.

So here's a place to start, you can probably add methods to the wrapper class to support additional operations, or change the way data is stored:

import heapq


class ExpiringCache(object):
    def __init__(self):
        self._dict = {}
        self._heap = []

    def add(self, key, expiry):
        self._dict[key] = expiry
        heapq.heappush(self._heap, (expiry, key))

    def contains(self, key):
        return key in self._dict

    def collect(self, maxage):
        while self._heap and self._heap[0][0] <= maxage:
            expiry, key = heapq.heappop(self._heap)
            if self._dict.get(key) == expiry:
                del self._dict[key]

    def items(self):
        return self._dict.items()

create a cache and add some items

>>> xc = ExpiringCache()
>>> xc.add('apples', 1)
>>> xc.add('bananas', 2)
>>> xc.add('mangoes', 3)

re-add an item with an even later expiry

>>> xc.add('apples', 4)

collect everything "older" than two time units

>>> xc.collect(2)    
>>> xc.contains('apples')
True
>>> xc.contains('bananas')
False

这篇关于理想的数据结构,快速查找,快速更新和简单的比较/排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆