如何使用相关系数矩阵进行聚类? [英] How to do clustering using the matrix of correlation coefficients?
问题描述
我有一个相关系数矩阵(n * n)。如何使用相关系数矩阵进行聚类?
I have a correlation coefficient matrix (n*n). How to do clustering using the correlation coefficient matrix?
我可以在SciPy中使用链接和聚类功能吗?
Can I use linkage and fcluster function in SciPy?
链接函数需要 n * m
矩阵(根据教程),但是我想使用n * n矩阵。
Linkage function needs n * m
matrix (according to tutorial), but I want to use n*n matrix.
我的代码是
corre = mp_N.corr() # mp_N is raw data (m*n matrix)
Z = linkage(corre, method='average') # 'corre' is correlation coefficient matrix
fcluster(Z,2,'distance')
此代码正确吗?
如果此代码错误,如何使用相关系数矩阵进行聚类?
Is this code right? If this code is wrong, how can I do clustering with correlation coefficient matrix?
推荐答案
使用相关性对数据进行聚类矩阵是一个合理的想法,但必须先对相关进行预处理。首先,由 numpy.corrcoef
返回的相关矩阵受到机器算术误差的影响:
Clustering data using a correlation matrix is a reasonable idea, but one has to pre-process the correlations first. First, the correlation matrix, as returned by numpy.corrcoef
, is affected by the errors of machine arithmetics:
- 它并不总是对称的。
- 对角项不一定总是正好1
这些可以通过对转置取平均值,并用1填充对角线来固定:
These can be fixed by taking average with the transpose, and filling the diagonal with 1:
import numpy as np
data = np.random.randint(0, 10, size=(20, 10)) # 20 variables with 10 observations each
corr = np.corrcoef(data) # 20 by 20 correlation matrix
corr = (corr + corr.T)/2 # made symmetric
np.fill_diagonal(corr, 1) # put 1 on the diagonal
第二,任何聚类方法的输入,例如 linkage
,都需要测量相异性对象的强。相关性衡量相似性。因此,需要对其进行转换,以便将0个相关性映射到一个大数,而将1个相关性映射到0。
Second, the input to any clustering method, such as linkage
, needs to measure the dissimilarity of objects. The correlation measures similarity. So it needs to be transformed in a way such that 0 correlation is mapped to a large number, while 1 correlation is mapped to 0.
此博文讨论了这种数据转换的几种方法,并建议相异度= 1-绝对(相关)
。这个想法是,强烈的负相关性也表明对象是相关的,就像正相关性一样。这是示例的延续:
This blog post discusses several ways of such data transformation, and recommends dissimilarity = 1 - abs(correlation)
. The idea is that strong negative correlation is also an indication that the objects are related, just as positive correlation is. Here is the continuation of the example:
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
dissimilarity = 1 - np.abs(corr)
hierarchy = linkage(squareform(dissimilarity), method='average')
labels = fcluster(hierarchy, 0.5, criterion='distance')
请注意,我们没有将完整的距离矩阵输入链接
,需要先使用 squareform
进行压缩。
Note that we don't feed a full distance matrix into linkage
, it needs to be compressed with squareform
first.
要使用的确切聚类方法和阈值取决于您所遇到的问题,没有通用规则。通常,0.5是用于相关的合理阈值,因此我做到了。使用我的20组随机数,我最终得到了7个簇:在标签
中编码为
What exact clustering methods to use, and what thresholds, depends on the context of your problem, there are no universal rules. Often, 0.5 is a reasonable threshold to use for correlation, so I did that. With my 20 sets of random numbers I ended up with 7 clusters: encoded in labels
as
[7, 7, 7, 1, 4, 4, 2, 7, 5, 7, 2, 5, 6, 3, 6, 1, 5, 1, 4, 2]
这篇关于如何使用相关系数矩阵进行聚类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!