如何使用相关系数矩阵进行聚类? [英] How to do clustering using the matrix of correlation coefficients?

查看:1459
本文介绍了如何使用相关系数矩阵进行聚类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相关系数矩阵(n * n)。如何使用相关系数矩阵进行聚类?

I have a correlation coefficient matrix (n*n). How to do clustering using the correlation coefficient matrix?

我可以在SciPy中使用链接和聚类功能吗?

Can I use linkage and fcluster function in SciPy?

链接函数需要 n * m 矩阵(根据教程),但是我想使用n * n矩阵。

Linkage function needs n * m matrix (according to tutorial), but I want to use n*n matrix.

我的代码是

corre = mp_N.corr()    # mp_N is raw data (m*n matrix)  
Z = linkage(corre, method='average')  # 'corre' is correlation coefficient matrix
fcluster(Z,2,'distance')

此代码正确吗?
如果此代码错误,如何使用相关系数矩阵进行聚类?

Is this code right? If this code is wrong, how can I do clustering with correlation coefficient matrix?

推荐答案

使用相关性对数据进行聚类矩阵是一个合理的想法,但必须先对相关进行预处理。首先,由 numpy.corrcoef 返回的相关矩阵受到机器算术误差的影响:

Clustering data using a correlation matrix is a reasonable idea, but one has to pre-process the correlations first. First, the correlation matrix, as returned by numpy.corrcoef, is affected by the errors of machine arithmetics:


  1. 它并不总是对称的。

  2. 对角项不一定总是正好1

这些可以通过对转置取平均值,并用1填充对角线来固定:

These can be fixed by taking average with the transpose, and filling the diagonal with 1:

import numpy as np
data = np.random.randint(0, 10, size=(20, 10))   # 20 variables with 10 observations each
corr = np.corrcoef(data)                         # 20 by 20 correlation matrix
corr = (corr + corr.T)/2                         # made symmetric
np.fill_diagonal(corr, 1)                        # put 1 on the diagonal

第二,任何聚类方法的输入,例如 linkage ,都需要测量相异性相似性。因此,需要对其进行转换,以便将0个相关性映射到一个大数,而将1个相关性映射到0。

Second, the input to any clustering method, such as linkage, needs to measure the dissimilarity of objects. The correlation measures similarity. So it needs to be transformed in a way such that 0 correlation is mapped to a large number, while 1 correlation is mapped to 0.

此博文讨论了这种数据转换的几种方法,并建议相异度= 1-绝对(相关)。这个想法是,强烈的负相关性也表明对象是相关的,就像正相关性一样。这是示例的延续:

This blog post discusses several ways of such data transformation, and recommends dissimilarity = 1 - abs(correlation). The idea is that strong negative correlation is also an indication that the objects are related, just as positive correlation is. Here is the continuation of the example:

from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform

dissimilarity = 1 - np.abs(corr)
hierarchy = linkage(squareform(dissimilarity), method='average')
labels = fcluster(hierarchy, 0.5, criterion='distance')

请注意,我们没有将完整的距离矩阵输入链接,需要先使用 squareform 进行压缩。

Note that we don't feed a full distance matrix into linkage, it needs to be compressed with squareform first.

要使用的确切聚类方法和阈值取决于您所遇到的问题,没有通用规则。通常,0.5是用于相关的合理阈值,因此我做到了。使用我的20组随机数,我最终得到了7个簇:在标签中编码为

What exact clustering methods to use, and what thresholds, depends on the context of your problem, there are no universal rules. Often, 0.5 is a reasonable threshold to use for correlation, so I did that. With my 20 sets of random numbers I ended up with 7 clusters: encoded in labels as

[7, 7, 7, 1, 4, 4, 2, 7, 5, 7, 2, 5, 6, 3, 6, 1, 5, 1, 4, 2] 

这篇关于如何使用相关系数矩阵进行聚类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆