匹配集的数据结构 [英] Data structure for matching sets

查看:121
本文介绍了匹配集的数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应用程序,我有一些集合。一组可能是

{4,7,12,18}

唯一编号,全部小于50。



然后我有几个数据项:

1 {1,2,4,7,8,12,18,23,29}

2 {3,4,6,7 ,15,23,34,38}

3 {4,7,12,18}

4 {1,4,7,12,13,14,15,16 ,17,18}

5 {2,4,6,7,13,15}



数据项1,3和4匹配因为它们包含集合中的所有项目。



我需要设计一个超快的数据结构,用于识别数据项是否是set 包括作为集合的一部分的所有成员(因此数据项是集合的超集)。我现在最好的估计是建议少于50,000套。



我目前的实现将我的集合和数据作为无符号的64位整数,并将集合存储在列表。然后检查一个数据项,我通过执行((set& data)== set)比较的列表进行迭代。它的作品,它的空间有效率,但它是缓慢的(O(n)),我很乐意交易一些记忆的一些表现。有没有人有任何更好的想法如何组织这个?



编辑:
非常感谢所有的答案。看来我需要提供一些关于这个问题的更多信息。我先获得集合,然后我逐个获取数据项。我需要检查数据项是否匹配其中一个集合。

对于给定的问题,集合很可能是块状1,3和9可能包含在95%的套;我可以在某种程度上提前预测(但不好)。



对于建议记忆的人:这是一个记忆功能的数据结构。这些集合表示已经计算的一般解决方案,数据项是该函数的新输入。通过将数据项与一般解决方案进行匹配,我可以避免大量处理。

解决方案

我看到另一个解决方案是双到你的(即,针对每个集合测试一个数据项),并且使用二进制树,其中每个节点测试是否包含特定项目。



例如如果你有A = {2,3}和B = {4}并且C = {1,3},你将有以下树

  _NOT_HAVE_ [1] ___ HAVE____ 
| |
_____ [2] _____ _____ [2] _____
| | | |
__ [3] __ __ [3] __ __ [3] __ __ [3] __
| | | | | | | |
[4] [4] [4] [4] [4] [4] [4] [4]
/ \ / \ / \ / \ / \ / \ / \ / \
。 B。 B。 B。 BBCBAAAA
CBCB
C

制作树后,您只需要进行50次比较---或者一组中可以有多少项目。



例如,对于{1,4},您可以通过树进行分支:右(集合有1),左(没有2),左,右,你得到[B],意思是只有集合B包含在{1,4}中。



这基本上被称为二进制决策图。如果你被节点中的冗余所冒犯(你应该是,因为2 ^ 50是很多节点...),那么你应该考虑简化的形式,这被称为简化的,有序的二进制决策图,是一种常用的数据结构。在这个版本中,当它们是冗余的时候,节点被合并,你不再有一个二叉树,而是一个有向无环图。



ROBBD上的维基百科页面可以为您提供更多信息,以及实现各种语言数据结构的库的链接。 / p>

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}

Data items 1, 3 and 4 match the set because they contain all items in the set.

I need to design a data structure that is super fast at identifying whether a data item is a member of a set includes all the members that are part of the set (so the data item is a superset of the set). My best estimates at the moment suggest that there will be fewer than 50,000 sets.

My current implementation has my sets and data as unsigned 64 bit integers and the sets stored in a list. Then to check a data item I iterate through the list doing a ((set & data) == set) comparison. It works and it's space efficient but it's slow (O(n)) and I'd be happy to trade some memory for some performance. Does anyone have any better ideas about how to organize this?

Edit: Thanks very much for all the answers. It looks like I need to provide some more information about the problem. I get the sets first and I then get the data items one by one. I need to check whether the data item is matches one of the sets.
The sets are very likely to be 'clumpy' for example for a given problem 1, 3 and 9 might be contained in 95% of sets; I can predict this to some degree in advance (but not well).

For those suggesting memoization: this is this the data structure for a memoized function. The sets represent general solutions that have already been computed and the data items are new inputs to the function. By matching a data item to a general solution I can avoid a whole lot of processing.

解决方案

I see another solution which is dual to yours (i.e., testing a data item against every set) and that is using a binary tree where each node tests whether a specific item is included or not.

For instance if you had the sets A = { 2, 3 } and B = { 4 } and C = { 1, 3 } you'd have the following tree

                      _NOT_HAVE_[1]___HAVE____
                      |                      |            
                _____[2]_____          _____[2]_____
                |           |          |           |
             __[3]__     __[3]__    __[3]__     __[3]__
             |     |     |     |    |     |     |     |
            [4]   [4]   [4]   [4]  [4]   [4]   [4]   [4]
            / \   / \   / \   / \  / \   / \   / \   / \
           .   B .   B .   B .   B    B C   B A   A A   A
                                            C     B C   B
                                                        C

After making the tree, you'd simply need to make 50 comparisons---or how ever many items you can have in a set.

For instance, for { 1, 4 }, you branch through the tree : right (the set has 1), left (doesn't have 2), left, right, and you get [ B ], meaning only set B is included in { 1, 4 }.

This is basically called a "Binary Decision Diagram". If you are offended by the redundancy in the nodes (as you should be, because 2^50 is a lot of nodes...) then you should consider the reduced form, which is called a "Reduced, Ordered Binary Decision Diagram" and is a commonly used data-structure. In this version, nodes are merged when they are redundant, and you no longer have a binary tree, but a directed acyclic graph.

The Wikipedia page on ROBBDs can provide you with more information, as well as links to libraries which implement this data-structure for various languages.

这篇关于匹配集的数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆