高效的数据结构,用于存储长序列(大多数是连续的)整数 [英] Efficient data structure for storing a long sequence of (mostly consecutive) integers

查看:78
本文介绍了高效的数据结构,用于存储长序列(大多数是连续的)整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要一个数据结构来有效地存储一长串数字.数字应该始终是整数,例如Longs.

I'd like a data structure to efficiently store a long sequence of numbers. The numbers should always be whole integers, let's say Longs.

我想利用(声称效率")的输入的特征是多头将主要是连续的.可能缺少值.并且这些值可以不按顺序交互.

The feature of the inputs I'd like to leverage (to claim 'efficiency') is that the longs will be mostly consecutive. There can be missing values. And the values can be interacted with out of order.

我希望数据结构支持以下操作:

I'd like the data structure to support the following operations:

  • addVal(n):添加单个值n
  • addRange(n,m):将n和m之间的所有值相加(含)
  • delVal(n):删除单个值n
  • delRange(n,m):删除n和m之间的所有值,包括端值
  • containsVal(n):返回结构中是否存在单个值n
  • containsRange(n,m):返回结构中是否存在n和m之间的所有值(包括在内)

从本质上讲,这是一种更特定的Set数据结构,可以利用数据的连续性来使用少于O(n)的内存,其中n是存储的值的数量.

In essence this is a more specific kind of Set data structure that can leverage the continuity of the data to use less than O(n) memory, where n is the number of values stored.

需要明确的是,尽管我认为有效地实现这种数据结构将需要我们在内部存储时间间隔,这对于用户而言是不可见的或不相关的.有一些间隔树分别存储多个间隔,并允许操作查找与给定点或间隔重叠的间隔数.但是从用户的角度来看,它的行为应完全类似于集合(基于范围的操作除外,因此可以有效地处理批量添加和删除操作).

To be clear, while I think an efficient implementation of such a data structure will require that we store intervals internally, that isn't visible or relevant to the user. There are some interval trees that store multiple intervals separately and allow operations to find the number of intervals that overlap with a given point or interval. But from the user perspective this should behave exactly like a set (except for the range-based operations so bulk additions and deletions can be handled efficiently).

示例:

dataStructure = ...
dataStructure.addRange(1,100) // [(1, 100)]
dataStructure.addRange(200,300) // [(1, 100), (200, 300)]
dataStructure.addVal(199) // [(1, 100), (199, 300)]
dataStructure.delRange(50,250) // [(1, 49), (251, 300)]

我的假设是最好通过某些基于树的结构来实现,但是我对如何做到这一点还没有很好的印象. 我想了解是否有一些已经满足此用例的常用数据结构,因为我不想重蹈覆辙.如果没有,我想听听您如何认为最好的实施方式.

My assumption is this would best be implemented by some tree-based structure but I don't have a great impression about how to do that yet. I wanted to learn if there was some commonly used data structure that already satisfies this use case, as I'd rather not reinvent the wheel. If not, I'd like to hear how you think this could best be implemented.

推荐答案

如果您不关心重复项,则您的间隔是不重叠的.您需要创建一个维护该不变性的结构.如果您需要类似numIntervalsContaining(n)的查询,那就是另一个问题.

If you don't care about duplicates, then your intervals are non-overlapping. You need to create a structure that maintains that invariant. If you need a query like numIntervalsContaining(n) then that is a different problem.

您可以使用BST来存储端点对,就像在C ++ std::set<std::pair<long,long>>中一样.解释是每个条目对应于间隔[n,m].您需要一个弱排序-这是左端点上通常的整数排序.单个intlong n作为[n,n]插入.我们必须维护所有节点间隔都不重叠的属性.以下是对操作顺序的简要评估.由于您已经指定了n,因此我将N用作树的大小.

You could use a BST that stores pairs of endpoints, as in a C++ std::set<std::pair<long,long>>. The interpretation is that each entry corresponds to the interval [n,m]. You need a weak ordering - it is the usual integer ordering on the left endpoint. A single int or long n is inserted as [n,n]. We have to maintain the property that all node intervals are non-overlapping. A brief evaluation of the order of your operations is as follows. Since you've already designated n I use N for the size of the tree.

  • addVal(n):添加单个值n:O(log N),与std::set<int>相同.由于间隔是不重叠的,因此您需要找到n的前身,这可以在O(log n)时间内完成(按

  • addVal(n): add a single value, n : O(log N), same as a std::set<int>. Since the intervals are non-overlapping, you need to find the predecessor of n, which can be done in O(log n) time (break it down by cases as in https://www.quora.com/How-can-you-find-successors-and-predecessors-in-a-binary-search-tree-in-order). Look at the right endpoint of this predecessor, and extend the interval or add an additional node [n,n] if necessary, which by left-endpoint ordering will always be a right child. Note that if the interval is extended (inserting [n+1,n+1] into a tree with node [a,n] forming the node [a,n+1]) it may now bump into the next left endpoint, requiring another merge. So there are a few edge cases to consider. A little more complicated than a simple BST, but still O(log N).

addRange(n,m):O(log N),过程相似.如果插入的间隔与另一个间隔不平凡,请合并间隔,以保持非重叠属性.如下所示,最糟糕的情况是O(n),因为我们插入的子间隔所消耗的子间隔数没有上限.

addRange(n, m): O(log N), process is similar. If the interval inserted intersects nontrivially with another, merge the intervals so that the non-overlapping property is maintained. The worst case performance is O(n) as pointed out below as there is no upper limit on the number of subintervals consumed by the one we are inserting.

请注意,我们可以使用正确的add()和addRange()方法维护非重叠属性,该属性已经由delete方法维护.我们需要最差的O(n)存储空间.

Note that we can maintain the non-overlapping property with correct add() and addRange() methods, it is already maintained by the delete methods. We need O(n) storage at the worst.

请注意,所有操作都是O(log N),插入范围[n,m]O(m-n)O(log(m-n))或类似内容完全不同.

Note that all operations are O(log N), and inserting the range [n,m] is nothing like O(m-n) or O(log(m-n)) or anything like that.

我认为您不在乎重复项,而只是会员资格.否则,您可能需要间隔树或KD树之类的东西,但它们与浮点数据更相关...

I assume you don't care about duplicates, just membership. Otherwise you may need an interval tree or KD-tree or something, but those are more relevant for float data...

这篇关于高效的数据结构,用于存储长序列(大多数是连续的)整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆