高效的算法来比较两组数字之间的相似性? [英] efficient algorithm to compare similarity between sets of numbers?

查看:165
本文介绍了高效的算法来比较两组数字之间的相似性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的组数字。每一集包含10个号码,我需要删除的有5个以上的号码(无序)与任何其他集匹配所有集合。

I have a large number of sets of numbers. Each set contains 10 numbers and I need to remove all sets that have 5 or more number (unordered) matches with any other set.

例如:

set 1: {12,14,222,998,1,89,43,22,7654,23}
set 2: {44,23,64,76,987,3,2345,443,431,88}
set 3: {998,22,7654,345,112,32,89,9842,31,23}

由于3组10号以上的组1和3套将被视为重复的,因为他们有5个符合数字。因此,在这种情况下,我会删除设置3(因为它被认为类似于设置1)。

Given the 3 sets of 10 numbers above sets 1 and sets 3 would be considered duplicates because they have 5 matching numbers. So, in this case I would remove set 3 (because it's considered similar to set 1).

我有超过10000套相比,我想这样做非常有效。我已经把这个了,我只是想不出来进行这种比较的有效方式(这将是巨大的,为此在单通)。

I have over 10000 sets to compare and I want to do this very efficiently. I've been turning this over and I just can't think of an efficient way to perform this comparison (it would be great to do this in a single pass).

什么想法?谢谢!

迈克

推荐答案

您应该重新考虑你的要求,因为它是,操作甚至没有一个明确的结果。例如,把这些集合:

You should rethink your requirements because as it is, the operation does not even have a well-defined result. For example, take these sets:

set 1: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} 
set 2: {6, 7, 8, 9, 10, 11, 12, 13, 14, 15} 
set 3: {11, 12, 13, 14, 15, 16, 17, 18, 19, 20}

如果你先考虑1和2是重复和消除设置1,那么2和3是也重复,你会留下只剩下一个组。但如果你消除了设置2,再1和3都没有比赛,你只剩下两套剩余。

If you first consider 1 and 2 to be "duplicates" and eliminate set 1, then 2 and 3 are also "duplicates" and you are left with only one remaining set. But if you instead eliminate set 2 first, then 1 and 3 have no matches and you are left with two sets remaining.

您可以轻松地扩展为完整的1万套这样,这将是可能的,这取决于它设置你比较并消除第一,你可能会只剩一组,或5000套。我不认为这是你想要的。

You can easily expand this to your full 10,000 sets so that it would be possible that depending on which sets you compare and eliminate first, you could be left with only a single set, or with 5,000 sets. I don't think that is what you want.

从数学上来说,你的问题是,你正在努力寻找等价类的,但相对于相似性你用它来定义它们不是等价关系。具体而言,它是不传递的。通俗地说,如果设置A是相似,设置B和集B是类似的设置C,那么你的定义并不保证A也是相似C,因此,你不能有意义消除类似的套。

Mathematically speaking, your problem is that you are trying to find equivalence classes, but the relation "similarity" you use to define them is not an equivalence relation. Specifically, it is not transitive. In layman's terms, if set A is "similar" to set B and set B is "similar" to set C, then your definition does not ensure that A is also "similar" to C, and therefore you cannot meaningfully eliminate similar sets.

您需要先明确你的需求担心的高效实现之前解决这个问题。要么找到一种方法来定义一个传递的相似性,或保留所有集合只有用比较的工作(或类似的套每单组列表)。

You need to first clarify your requirements to deal with this problem before worrying about an efficient implementation. Either find a way to define a transitive similarity, or keep all sets and work only with comparisons (or with a list of similar sets for each single set).

这篇关于高效的算法来比较两组数字之间的相似性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆