如何确定是否两个分区的数据点(聚类)是相同的? [英] How to determine if two partitions (clusterings) of data points are identical?

查看:196
本文介绍了如何确定是否两个分区的数据点(聚类)是相同的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些空间任意 N 数据点,我集群他们。
我聚类算法的结果是一个int载体psented分区重新$ P $ 长度 N 分配每个指向一个集群。 的值范围从0到(可能) N-1

I have n data points in some arbitrary space and I cluster them.
The result of my clustering algorithm is a partition represented by an int vector l of length n assigning each point to a cluster. Values of l ranges from 0 to (possibly) n-1.

例如:

l_1 = [ 1 1 1 0 0 2 6 ]

时的一个分区 N = 7 点到4类:前三个点都集中在一起,第四和第五在一起,最后两个点形成两个不同的单身集群。

Is a partition of n=7 points into 4 clusters: first three points are clustered together, the fourth and fifth are together and the last two points forms two distinct singleton clusters.

假设我有两个分区 L_1 L_2 我怎么能有效确定他们重新presents相同的分区?

Suppose I have two partitions l_1 and l_2 how can I efficiently determine if they represents identical partitions?

例如:

l_2 = [ 2 2 2 9 9 3 1 ]

L_1 ,因为它重新presents点相同的分区(尽管集群的数字/标签是不相同)。
另一方面

is identical to l_1 since it represents the same partitions of the points (despite the fact that the "numbers"/"labels" of the clusters are not identical).
On the other hand

l_3 = [ 2 2 2 9 9 3 3 ]

不再是相同的,因为它组合在一起的最后两分。

is no longer identical since it groups together the last two points.

我在寻找用C溶液++,Python或Matlab的。

I'm looking for a solution in either C++, python or Matlab.

一个幼稚的做法是比较共生矩阵

A naive approach would be to compare the co-occurrence matrix

c1 = bsxfun( @eq, l_1, l_1' );
c2 = bsxfun( @eq, l_2, l_2' );
l_1_l_2_are_identical = all( c1(:)==c2(:) );

该共生矩阵 C1 是大小 N X N 如果点 K M 在同一个集群,以其他方式(不论集群数量/标签)。
因此,如果共生矩阵 C1 C2 相同,则 L_1 L_2 重present相同的分区。

The co-occurrence matrix c1 is of size nxn with true if points k and m are in the same cluster and false otherwise (regardless of the cluster "number"/"label").
Therefore if the co-occurrence matrices c1 and c2 are identical then l_1 and l_2 represent identical partitions.

然而,由于点的数量, N ,可能是相当大的,我想避免O( N ^ 2 )的解决方案...

However, since the number of points, n, might be quite large I would like to avoid O(n^2) solutions...

任何想法?

谢谢!

推荐答案

在两个分区相同的?

也许,如果他们有相同的成员。

Probably if they have the exact same members.

所以,如果你只是想测试的身份,您可以执行以下操作:

So if you just want to test for identity, you can do the following:

替换每个分区ID用的最小的分区中的对象ID。

Substitute each partition ID with the smallest object ID in the partition.

那么双分区成为是相同的,当且仅当此重新presentation是相同的。

Then two partitionings are identical if and only if this representation is identical.

在你上面的例子,让我们假设矢量指数1。7是你的对象ID。然后,我会得到规范形式

In your example above, lets assume the vector index 1 .. 7 is your object ID. Then I would get the canonical form

[ 1 1 1 4 4 6 7 ]
  ^ first occurrence at pos 1 of 1 in l_1 / 2 in l_2
        ^ first occurrence at pos 4

有关L_1和L_2,而L_3 canonicalizes到

for l_1 and l_2, whereas l_3 canonicalizes to

[ 1 1 1 4 4 6 6 ]

为了使它更清楚,这里是另外一个例子:

To make it more clear, here is another example:

l_4 = [ A B 0 D 0 B A ]

canonicalizes到

canonicalizes to

      [ 1 2 3 4 3 2 1 ]

由于集群A中第一次出现是在位置1,B在位置2等。

since the first occurence of cluster "A" is at position 1, "B" at position 2 etc.

如果你想衡量的相似双聚类是,一个好办法就是看precision /召回/对象对F1,其中对(A,B)存在当且仅当a和b属于同一群集。

If you want to measure how similar two clusterings are, a good approach is to look at precision/recall/f1 of the object pairs, where the pair (a,b) exists if and only if a and b belong to the same cluster.

更新:由于有人声称,这是二次,我将进一步明确

Update: Since it was claimed that this is quadratic, I will further clarify.

要产生的规范形式,使用下面的方法(实际蟒蛇code):

To produce the canonical form, use the following approach (actual python code):

def canonical_form(li):
  """ Note, this implementation overwrites li """
  first = dict()
  for i in range(len(li)):
    v = first.get(li[i])
    if v is None:
      first[li[i]] = i
      v = i
    li[i] = v
  return li

print canonical_form([ 1, 1, 1, 0, 0, 2, 6 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 1 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 3 ])
# [0, 0, 0, 3, 3, 5, 5]
print canonical_form(['A','B',0,'D',0,'B','A'])
# [0, 1, 2, 3, 2, 1, 0]
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,1])
# True
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,3])
# False

这篇关于如何确定是否两个分区的数据点(聚类)是相同的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆