如何找到与在共同项目的最大数目的子集? [英] How to find the subset with the greatest number of items in common?

查看:124
本文介绍了如何找到与在共同项目的最大数目的子集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们说我有一些'知'集:

Let's say I have a number of 'known' sets:

1 {a, b, c, d, e}
2 {b, c, d, e}
3 {a, c, d}
4 {c, d}

我想一个函数,它采用一组作为输入,(例如 {A,C,D,E} ),并发现有一组最高数量的元件,并且在共同没有更多的其他项目。换言之,以最大的基数的子集。答案不必是一个适当的子集。在这种情况下,答案应该是 {A,C,D}

I'd like a function which takes a set as an input, (for example {a, c, d, e}) and finds the set that has the highest number of elements, and no more other items in common. In other words, the subset with the greatest cardinality. The answer doesn't have to be a proper subset. The answer in this case would be {a, c, d}.

编辑:上面的例子是错误的,现在修好了

我试图找到这样做的绝对是最有效的方式。

I'm trying to find the absolute most efficient way of doing this.

(在下面,我假设比较两组成本的 O(1)的为简单起见,该操作是我控制范围之内,所以没有一点思考它。实际上这将是被比较的两个集合的基数的函数。)

(In the below, I am assuming that the cost of comparing two sets is O(1) for the sake of simplicity. That operation is outside my control so there's no point thinking about it. In truth it would be a function of the cardinality of the two sets being compared.)

Candiate 1:

Candiate 1:

生成输入的所有子集,然后迭代已知集并返回最大的一个是一个子集。这样做的缺点是,复杂性将是这样的 O(N'次; M!的),其中的 N 的是输入设置和基数的 M 的是已知的子集的数量。

Generate all subsets of the input, then iterate over the known sets and return the largest one that is a subset. The downside to this is that the complexity will be something like O(n! × m), where n is the cardinality of the input set and m is the number of 'known' subsets.

候选人1A(感谢@bratbrat):

Candidate 1a (thanks @bratbrat):

遍历所有已知集和计算交会的cardinatlity,并采取具有最高值。这将是的 O(N)的,其中的 N 的是亚群的数量。

Iterate over all 'known' sets and calculate the cardinatlity of the intersection, and take the one with the highest value. This would be O(n) where n is the number of subsets.

候选2:

创建逆表并计算输入和已知组之间的欧氏距离。这可能是相当快的。我不清楚我怎么可能会限制这仅包括子集没有后续的 O(N)的过滤器。

Create an inverse table and calculate the euclidean distance between the input and the known sets. This could be quite quick. I'm not clear how I could limit this to include only subsets without a subsequent O(n) filter.

候选人3:

遍历所有已知的组和与之比较的输入。复杂性会的 O(N)的,其中的 N 的是知名台的数量。

Iterate over all known sets and compare against the input. The complexity would be O(n) where n is the number of known sets.

我已经在我手上内置到Python和Redis的设定功能。

I have at my disposal the set functions built into Python and Redis.

所有这些似乎特别大。想法?套的数量可能会很大(约10万在猜测)。

None of these seems particularly great. Ideas? The number of sets may get large (around 100,000 at a guess).

推荐答案

有没有可能的方式做到这一点,在不到O(n)时间...只是读取输入为O(N)。

There's no possible way to do this in less than O(n) time... just reading the input is O(n).

一对夫妇的想法:

排序组由大小(最大第一),并搜索是输入集合的子集的第一组。一旦你找到一个,你不必检查其余部分。

Sort the sets by size (biggest first), and search for the first set which is a subset of the input set. Once you find one, you don't have to examine the rest.

如果可能的项目可以是在组的数量是有限的,则可以重新由位矢量present它们。然后,你可以计算出一个查找表来告诉你一组给定的是否是输入集合的子集。 (走在位每个输入所考虑设置,一个字一个字,索引每个单词到相应的表中,如果你发现一个,来告诉你,这是不是一个子集,同样,你可以直接移动到下一个输入集。 )这是否会实际购买你的表现,取决于实现语言。我想这将是最有效的与原始的整数类型语言,如C或Java。

If the number of possible items which could be in the sets is limited, you could represent them by bit-vectors. Then you could calculate a lookup table to tell you whether a given set is a subset of the input set. (Walk down the bits for each input set under consideration, word by word, indexing each word into the appropriate table. If you find an entry telling you that it's not a subset, again, you can move on directly to the next input set.) Whether this would actually buy you performance, depends on the implementation language. I imagine it would be most effective in a language with primitive integral types, like C or Java.

这篇关于如何找到与在共同项目的最大数目的子集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆