快速路口操作的数据结构? [英] Data structures for fast intersection operations?

查看:77
本文介绍了快速路口操作的数据结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

随机选择两组,两组都包含不同的键(一个键可能属于多个组,一组永远不能包含重复的键).

返回一个整数,代表两个组中键的数量.

例如,intersect({1,2,3,4},{3,4,5})返回2.

我只需要交叉点的大小即可.我不需要确切知道交叉点中有哪些键.

是否有任何数据结构在少于O(n)的时间内支持这种操作??

读出数据确实需要O(n)时间,但是这不会导致您不能在不到O(n)的时间内完成交叉操作的结论..

为这种情况成像:

我有N套,每套包含100个按键.我读了起来,这是N * 100个运算.现在,我想知道这对集合具有最大的交集,即O(N²)个交集运算.因此,我想减少交集运算的复杂性.我不是真的关心读取和构造集合需要花费多少时间,最多为N * 100,这与O(N²)相交运算没有什么不同.

请注意,通过执行少于O(N²)个交集操作,您不可能找到最大交集的对,我可以证明.您必须交叉口操作.

(他的基本思想是,让我们想象一个完整的图,具有N个顶点,每个顶点代表一个集合,Nx(N-1)/2个边,每个代表连接对的交点.现在给每个边您想要的非负权重(代表相交大小),我总是可以构造满足这些Nx(N-1)/2个边缘权重的N个集合.这证明了我的主张.)

解决方案

我建议您研究一下两种可能的替代方法,它们在实践中特别有效(尤其是在大型情况下,strong>).

1.布鲁姆过滤器数据结构

Bloom过滤器是一种空间高效(基于位数组)的概率数据结构,用于测试元素是否为集合的成员.错误的肯定匹配是可能的,但错误的否定匹配是不可能的.

在误报率和布隆过滤器的内存占用量之间需要权衡.因此,可以针对不同的用例来估计Bloom过滤器的合适大小.

每个集合都可以与其自己的Bloom Filter关联.很容易获得与不同集合的交集相对应的Bloom过滤器:可以使用按位 AND 操作.

具有与相交点相对应的Bloom筛选器,可以快速找到所有相交集中存在的项.

除此之外,可以估计交集的基数,而无需对整个集合进行实际迭代: 跳过列表数据结构

跳过列表"是一种数据结构,它允许在元素的有序序列中进行快速搜索和交集.通过维护子序列的链接层次结构,可以快速搜索和相交,每个子序列都跳过更少的元素.

简而言之,跳过列表"与普通的链表"数据结构非常相似,但是,跳过列表"的每个节点都具有几个指向项的其他指针,这些指针位于更远的位置(跳过"指针的指针).列表中的其他两个节点.

因此,为了获得交点-需要在所有相交的跳过列表"中维护指针.在跳过列表"的交集期间,指针会跳过所有交叉的跳过列表"中不存在的项目.因此,通常,相交操作的运行时复杂度要比 O(N).

跳过列表的交集的算法在书"Introduction to Information Retrieval"中描述.(由Christopher D. Manning,Prabhakar Raghavan,HinrichSchütze撰写): http://nlp.stanford.edu/IR-book/html/htmledition/faster-postings-list-intersection-via-skip-pointers-1.html

跳过列表已在高性能,功能齐全的文本搜索引擎库中得到积极使用: Apache Lucene (在反向索引"组件中使用了跳过列表).

这是有关Lucene中跳过列表用法的另一个Stackoverflow问题: 解决方案

I would advice you to take a look at the two possible alternatives, which work particularly well in practice (especially in case of the large sets).

1. The Bloom Filter data structure

A Bloom filter is a space-efficient (based on the bit array) probabilistic data structure, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.

There is a trade-off between False Positive rate and the memory footprint of the Bloom Filter. Hence, it is possible to estimate the appropriate size of the Bloom Filter for different use cases.

Each set can be associated with its own Bloom Filter. It is very easy to obtain the Bloom Filter, which corresponds to the intersection of the different sets: all bit arrays, which correspond to the different Bloom Filters, can be combined using the bitwise AND operation.

Having the Bloom Filter, which corresponds to the intersection, it is possible to find quickly the items, which are present in all intersected sets.

Apart from that, it is possible to estimate the cardinality of the intersection without the actual iteration over the entire sets: https://en.wikipedia.org/wiki/Bloom_filter#The_union_and_intersection_of_sets

2. The Skip list data structure

A Skip List is a data structure that allows fast search and intersection within an ordered sequence of elements. Fast search and intersection are made possible by maintaining a linked hierarchy of subsequences, each skipping over fewer elements.

Succinctly saying, the Skip List is very similar to the plain Linked List data structure, however each node of the Skip List has a couple of additional pointers to items, which are located further (pointers, which "skips" over the couple of other nodes of the list).

Hence, in order to obtain the intersection - it is needed to maintain the pointers inside all Skip Lists, which are being intersected. During the intersection of the Skip Lists pointers are jumping over the items, which are not present in all intersected Skip Lists. Hence, usually the runtime complexity of the intersection operation is faster then O(N).

The algorithm of the intersection of Skip Lists is described in the book "Introduction to Information Retrieval" (written by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze): http://nlp.stanford.edu/IR-book/html/htmledition/faster-postings-list-intersection-via-skip-pointers-1.html

Skip Lists are actively used in a high-performance, full-featured text search engine library: Apache Lucene (Skip Lists are used inside the Inverted Index component).

Here is an additional Stackoverflow question about the usage of Skip Lists in Lucene: how lucene use skip list in inverted index?

这篇关于快速路口操作的数据结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆