两个数据集或数组之间的相似性 [英] Similarity between two data sets or arrays

查看:445
本文介绍了两个数据集或数组之间的相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我有一个像这样的数据集:

Let's say I have a dataset that look like this:

{A:1, B:3, C:6, D:6}

我还有一个其他集合的列表以比较我的特定集合:

I also have a list of other sets to compare my specific set:

{A:1, B:3, C:6, D:6},  
{A:2, B:3, C:6, D:6},  
{A:99, B:3, C:6, D:6},  
{A:5, B:1, C:6, D:9},  
{A:4, B:2, C:2, D:6}

我的条目可以显示为一个表格(有四列,A,B,C,D和E).

My entries could be visualized as a Table (with four columns, A, B, C, D, and E).

如何找到最相似的集合? 在此示例中,第1行是完美匹配,第2行紧随其后,而第3行距离很远.

How can I find the set with the most similarity? For this example, row 1 is a perfect match and row 2 is a close second, while row 3 is quite far away.

我正在考虑计算一个简单的增量,例如:Abs(a1 - a2) + Abs(b1 - b2) + etc并可能得到一个

I am thinking of calculating a simple delta, for example: Abs(a1 - a2) + Abs(b1 - b2) + etc and perhaps get a correlation value for the entries with the best deltas.

这是有效的方法吗? 这个问题的名字是什么?

Is this a valid way? And what is the name of this problem?

推荐答案

距离"或相似性"可以提到这类问题.

"Distance" or "similarity" could refer to this type of problem.

像您所做的那样,简单地计算绝对差之和应该可以很好地工作.这称为 曼哈顿距离 .用数学术语来说,应该是:x ∈ (a,b,c,d) Abs(x1 - x2).

Simply calculating the sum of absolute difference, as you've done, should work fairly well. This is called the Manhattan distance. In mathematical terms, it would be: x ∈ (a,b,c,d) Abs(x1 - x2).

尽管最好的方法实际上取决于您想要的行为.

Although the best measure really depends on what behaviour you want.

比率可能是一个更好的主意.

Ratio could potentially be a better idea.

考虑类似1000000, 5, 5, 5999995, 5, 5, 51000000, 0, 5, 5之类的东西.

Consider something like 1000000, 5, 5, 5 vs 999995, 5, 5, 5 and 1000000, 0, 5, 5.

根据上面的公式,第一个将与第二个和第三个具有相同的相似性.

According to the above formula, the first would have the same similarity to both the second and the third.

如果不希望这样(因为999995可以认为与1000000相当接近,而0可以认为与5距离很远),则在以下情况下应除以两者中的最大值计算每个距离.

If this is not desired (as 999995 can be considered pretty close to 1000000, while 0 can be thought of as quite far from 5), you should divide by the maximum of the two when calculating each distance.

x ∈ (a,b,c,d) [ Abs(x1 - x2) / max(x1, x2) ]

x ∈ (a,b,c,d) [ Abs(x1 - x2) / max(x1, x2) ]

这会将每个数字放在0到1之间,这是两个值之间的百分比差异.

This will put every number between 0 and 1, which is the percentage difference between the values.

这意味着,对于我们上面的示例,我们认为1000000, 5, 5, 5999995, 5, 5, 5非常相似(因为上面的总和为|1000000-999995|/1000000 + 0 + 0 + 0 = 0.000005),并且1000000, 5, 5, 51000000, 0, 5, 5被认为是非常相似的(因为总和为|0+5|/5 + 0 + 0 + 0 = 1).

This means that, for our above example, we'd consider 1000000, 5, 5, 5 and 999995, 5, 5, 5 to be very similar (since the above sum will be |1000000-999995|/1000000 + 0 + 0 + 0 = 0.000005) and 1000000, 5, 5, 5 and 1000000, 0, 5, 5 will be considered much more different (since the sum will be |0+5|/5 + 0 + 0 + 0 = 1).

如果可能出现负值,则需要适当更新公式.您需要根据要解决的问题来决定如何处理. 10 to 0应该与5 to -5差不多(或等同于)吗?

If negative values are possible, the formula would need to be updated appropriately. You'd need to decide how you want to handle that based on the problem you're trying to solve. Should 10 to 0 be more or less different than (or equivalent to) 5 to -5?

考虑类似A=1, B=2, C=3, D=4A=4, B=1, C=2, D=3之类的东西.

Consider something like A=1, B=2, C=3, D=4 and A=4, B=1, C=2, D=3.

虽然每个单独的元素都发生了变化,但是该集合仍然由1, 2, 3, 4组成,并且每个元素仅移动了1个位置(除了4之外).

While every individual element has changed, the set still consists of 1, 2, 3, 4 and each element is simply shifted by 1 position (apart from 4).

对于某些问题,这根本不重要,上面的内容与从A=1, B=11, C=21, D=31A=2, B=12, C=22, D=32并没有什么不同.对于其他问题,它可能非常相关.

For some problems this isn't going to matter at all and the above wouldn't be all that different than going from A=1, B=11, C=21, D=31 to A=2, B=12, C=22, D=32. For other problems it could be quite relevant though.

对于诸如字符串或数组之类的序列,插入,删除或移动元素的想法很有意义.如果是这样,您可能希望查看编辑距离,其中常见的是 Levenshtein距离.您可能还需要考虑修改此值,以考虑各个值相差多少(但这并不容易).

For a sequence like a string or array, the idea of inserting, deleting or shifting elements could make sense. If so, you would want to look at edit distance, a common one of which would be Levenshtein distance. You might also want to think about modifying this to consider how much individual values differ by (but this would not be trivial).

对于诸如集合这样的元素,元素是可以互换的,但是元素上并没有严格的顺序({1, 2, 3}{3, 1, 2}相同).如果是这种情况,最简单的方法可能是对值进行排序并仅使用编辑距离.您也许还可以通过某种方式同时遍历这两个值,这将使您可以更轻松地将值之间的差异考虑在内.

For something like a set, elements are interchangeable, but there wouldn't really be a strict order on the elements ({1, 2, 3} is the same as {3, 1, 2}). If this is the case, the simplest might be to sort the values and just use edit distance. You may also be able to loop through both at the same time in some way, which would allow you to more easily take the differences between values into account.

这篇关于两个数据集或数组之间的相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆