数据结构匹配套 [英] Data structure for matching sets

查看:174
本文介绍了数据结构匹配套的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应用程序,我公司拥有一批集。一组可能
{4,7,12,18}
独特的数字和所有小于50。

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

然后我有几个数据项:
1 {1,2,4,7,8,12,18,23,29}
2 {3,4,6,7,15,23,34,38}
3 {4,7,12,18}
4 {1,4,7,12,13,14,15,16,17,18}​​
5 {2,4,6,7,13,15}

I then have several data items:
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}

的数据项1,3和4的匹配的集合,因为它们包含在该组的所有项目。

Data items 1, 3 and 4 match the set because they contain all items in the set.

我需要设计一个数据结构,它是超快速查明一个数据项的是否是一组中的一员包括所有属于集的一部分的部件(因此该数据项是一个该组的超集)。此刻我最好的估计表明,将有不少于5万套。

I need to design a data structure that is super fast at identifying whether a data item is a member of a set includes all the members that are part of the set (so the data item is a superset of the set). My best estimates at the moment suggest that there will be fewer than 50,000 sets.

我目前实现了我的集合和数据为无符号64位整数,并保存在一个列表中的组。然后检查数据项我通过列表迭代做了((集&安培;数据)==组)比较。它的工作原理和它的空间利用率,但是它的速度慢(为O(n)),我很乐意为一些性能方面的一些内存。有没有人有关于如何组织这个更好的想法?

My current implementation has my sets and data as unsigned 64 bit integers and the sets stored in a list. Then to check a data item I iterate through the list doing a ((set & data) == set) comparison. It works and it's space efficient but it's slow (O(n)) and I'd be happy to trade some memory for some performance. Does anyone have any better ideas about how to organize this?

编辑: 非常感谢所有的答案。它看起来像我需要提供有关该问题的更多的信息。我得到的第一个集合,然后我得到的数据项目一个接一个。我需要检查数据项是否匹配台之一。
集是非常可能是'块状'例如对于给定的问题1,图3和9中可能包含95%的套;我可以predict这在一定程度上提前(但不是很好)。

Thanks very much for all the answers. It looks like I need to provide some more information about the problem. I get the sets first and I then get the data items one by one. I need to check whether the data item is matches one of the sets.
The sets are very likely to be 'clumpy' for example for a given problem 1, 3 and 9 might be contained in 95% of sets; I can predict this to some degree in advance (but not well).

有关那些暗示记忆化:这是这种数据结构用于memoized函数。集重新那些已经被计算present一般性的解决方案和数据项是新函数的输入。通过匹配一个数据项的通用解决方案,我可以避开一大堆的处理。

For those suggesting memoization: this is this the data structure for a memoized function. The sets represent general solutions that have already been computed and the data items are new inputs to the function. By matching a data item to a general solution I can avoid a whole lot of processing.

推荐答案

我看到另一种解决方案是双到你(即测试对每个组的数据项),并且使用的是二叉树,其中每个节点测试是否具体项目包括与否。

I see another solution which is dual to yours (i.e., testing a data item against every set) and that is using a binary tree where each node tests whether a specific item is included or not.

例如,如果你有集合A = {2,3}和B = {4}和C = {1,3}你有以下的树

For instance if you had the sets A = { 2, 3 } and B = { 4 } and C = { 1, 3 } you'd have the following tree

                      _NOT_HAVE_[1]___HAVE____
                      |                      |            
                _____[2]_____          _____[2]_____
                |           |          |           |
             __[3]__     __[3]__    __[3]__     __[3]__
             |     |     |     |    |     |     |     |
            [4]   [4]   [4]   [4]  [4]   [4]   [4]   [4]
            / \   / \   / \   / \  / \   / \   / \   / \
           .   B .   B .   B .   B    B C   B A   A A   A
                                            C     B C   B
                                                        C

让树后,你只需做50比较---或者,你可以怎么过许多项目在一组。

After making the tree, you'd simply need to make 50 comparisons---or how ever many items you can have in a set.

例如,对于{1,4},您通过分支树:右(下集有1个),左(没有2),左,右,你会得到[B],意思是只设置B被包含在{1,4}

For instance, for { 1, 4 }, you branch through the tree : right (the set has 1), left (doesn't have 2), left, right, and you get [ B ], meaning only set B is included in { 1, 4 }.

这基本上是所谓的二元决策图。如果你是在节点的冗余得罪了(因为你是应该的,因为2 ^ 50是很多节点...),那么你应该考虑的简化形式,它被称为精简,有序二元决策图和是一种常用的数据结构。在这个版本中,节点的时候,他们是多余的合并,你不再有一个二叉树,但有向无环图。

This is basically called a "Binary Decision Diagram". If you are offended by the redundancy in the nodes (as you should be, because 2^50 is a lot of nodes...) then you should consider the reduced form, which is called a "Reduced, Ordered Binary Decision Diagram" and is a commonly used data-structure. In this version, nodes are merged when they are redundant, and you no longer have a binary tree, but a directed acyclic graph.

在ROBBDs的维基百科页面可以为您提供更多信息,以及链接到实现这个数据 - 库结构不同的语言。

The Wikipedia page on ROBBDs can provide you with more information, as well as links to libraries which implement this data-structure for various languages.

这篇关于数据结构匹配套的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆