如何使用相关系数矩阵进行聚类？ [英] How to do clustering using the matrix of correlation coefficients?

查看：1459 发布时间：2020/10/3 2:05:57 python scipy cluster-analysis correlation linkage

本文介绍了如何使用相关系数矩阵进行聚类？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个相关系数矩阵（n * n）。如何使用相关系数矩阵进行聚类？

I have a correlation coefficient matrix (n*n). How to do clustering using the correlation coefficient matrix?

我可以在SciPy中使用链接和聚类功能吗？

Can I use linkage and fcluster function in SciPy?

链接函数需要 n * m 矩阵（根据教程），但是我想使用n * n矩阵。

Linkage function needs n * m matrix (according to tutorial), but I want to use n*n matrix.

我的代码是

corre = mp_N.corr()    # mp_N is raw data (m*n matrix)  
Z = linkage(corre, method='average')  # 'corre' is correlation coefficient matrix
fcluster(Z,2,'distance')

此代码正确吗？
如果此代码错误，如何使用相关系数矩阵进行聚类？

Is this code right? If this code is wrong, how can I do clustering with correlation coefficient matrix?

推荐答案

使用相关性对数据进行聚类矩阵是一个合理的想法，但必须先对相关进行预处理。首先，由 numpy.corrcoef 返回的相关矩阵受到机器算术误差的影响：

Clustering data using a correlation matrix is a reasonable idea, but one has to pre-process the correlations first. First, the correlation matrix, as returned by numpy.corrcoef, is affected by the errors of machine arithmetics:

它并不总是对称的。

对角项不一定总是正好1

这些可以通过对转置取平均值，并用1填充对角线来固定：

These can be fixed by taking average with the transpose, and filling the diagonal with 1:

import numpy as np
data = np.random.randint(0, 10, size=(20, 10))   # 20 variables with 10 observations each
corr = np.corrcoef(data)                         # 20 by 20 correlation matrix
corr = (corr + corr.T)/2                         # made symmetric
np.fill_diagonal(corr, 1)                        # put 1 on the diagonal

第二，任何聚类方法的输入，例如 linkage ，都需要测量相异性相似性。因此，需要对其进行转换，以便将0个相关性映射到一个大数，而将1个相关性映射到0。

Second, the input to any clustering method, such as linkage, needs to measure the dissimilarity of objects. The correlation measures similarity. So it needs to be transformed in a way such that 0 correlation is mapped to a large number, while 1 correlation is mapped to 0.

此博文讨论了这种数据转换的几种方法，并建议相异度= 1-绝对（相关）。这个想法是，强烈的负相关性也表明对象是相关的，就像正相关性一样。这是示例的延续：

This blog post discusses several ways of such data transformation, and recommends dissimilarity = 1 - abs(correlation). The idea is that strong negative correlation is also an indication that the objects are related, just as positive correlation is. Here is the continuation of the example:

from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform

dissimilarity = 1 - np.abs(corr)
hierarchy = linkage(squareform(dissimilarity), method='average')
labels = fcluster(hierarchy, 0.5, criterion='distance')

请注意，我们没有将完整的距离矩阵输入链接，需要先使用 squareform 进行压缩。

Note that we don't feed a full distance matrix into linkage, it needs to be compressed with squareform first.

要使用的确切聚类方法和阈值取决于您所遇到的问题，没有通用规则。通常，0.5是用于相关的合理阈值，因此我做到了。使用我的20组随机数，我最终得到了7个簇：在标签中编码为

What exact clustering methods to use, and what thresholds, depends on the context of your problem, there are no universal rules. Often, 0.5 is a reasonable threshold to use for correlation, so I did that. With my 20 sets of random numbers I ended up with 7 clusters: encoded in labels as

[7, 7, 7, 1, 4, 4, 2, 7, 5, 7, 2, 5, 6, 3, 6, 1, 5, 1, 4, 2]

这篇关于如何使用相关系数矩阵进行聚类？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用相关系数矩阵进行聚类？ [英] How to do clustering using the matrix of correlation coefficients?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用相关系数矩阵进行聚类？ [英] How to do clustering using the matrix of correlation coefficients?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭