用scipy聚类-通过距离矩阵聚类,如何取回原始对象 [英] Clustering with scipy - clusters via distance matrix, how to get back the original objects

查看:198
本文介绍了用scipy聚类-通过距离矩阵聚类,如何取回原始对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法在scipy中找到关于集群的足够简单的教程或描述,所以我将尝试解释我的问题:

I can't seam to find any simple enough tutorials or descriptions on clustering in scipy, so I'll try to explain my problem:

我尝试集群文档(分层的聚集聚类),并为每个文档创建了一个向量,并生成了对称的距离矩阵。 vector_list包含代表每个文档的(非常长的)向量。向量列表的顺序与我的输入文档列表的顺序相同,以便(希望)能够将聚类结果与相应的文档进行匹配。

I try to cluster documents (hierarchical agglomerative clustering) , and have created a vector for each document and produced a symmetric distance matrix. The vector_list contains (really long) vectors representing each document. The order of this list of vectors is the same as my list of input documents so that I'll (hopefully) be able to match the results of the clustering with the corresponding document.

distances = distance.cdist(vector_list, vector_list, 'euclidean') 

这给出了这样的矩阵,其中对角线是每个文档到其自身的距离(始终为0)

This gives a matrix like this, where the diagonal line is each documents distance to itself (always 0)

[0 5 4]
[5 0 4]
[5 4 0]

我将此距离矩阵提供给scipys的links()函数。

I feed this distance matrix to scipys' linkage() function.

clusters = hier.linkage(distances, method='centroid', metric='euclidean')

这返回的结果我不太确定是什么,但以数据类型numpy.ndarray出现。根据文档,我可以将其再次馈入fcluster以获取扁平集群。我将距离矩阵中最大距离的一半用作阈值。

this returns something I'm not quite sure what is, but comes out as datatype numpy.ndarray. According to the docs I can feed this again into fcluster to get 'flat clusters'. I use half of the max distance in the distance matrix as threshold.

idx = hier.fcluster(clu,0.5*distances.max(), 'distance')

这将返回一个numpy.ndarray,对我来说也没有太大意义。一个例子是[6 3 1 7 1 8 9 4 5 2]

This returns a numpy.ndarray that again does not make much sense to me. An example is [6 3 1 7 1 8 9 4 5 2]

所以我的问题是:我从链接得到了什么?和 fcluster 函数,以及如何从那里返回我最初创建距离矩阵的文档,以查看聚类是否有意义?我对吗?

So my question: what is it that I get from the linkage and fcluster functions, and how can I go from there and back to my document that I created the distance matrix for in the first place, to see if the clusters makes any sense? Am I doing this right?

推荐答案

首先,您不需要使用<$ c $来完成整个过程。 c> cdist 和 linkage (如果您使用 fclusterdata 而不是 fcluster ;该函数可以提供(n_documents,n_features)个术语计数,tf-idf值或任何功能的数组。

First off, you don't need to go through the entire process with cdist and linkage if you use fclusterdata instead of fcluster; that function you can feed an (n_documents, n_features) array of term counts, tf-idf values, or whatever your features are.

fclusterdata 的输出与 fcluster 的输出相同:数组 T ,使得 T [i] 是原始观测值 i 属于。即, cluster.hierarchy 模块根据您设置为 0.5 * distances.max()。在您的情况下,第三个文档和第五个文档聚集在一起,但是所有其他文档都形成了它们自己的群集,因此您可能需要设置更高的阈值或使用不同的准则

The output from fclusterdata is the same as that of fcluster: an array T such that "T[i] is the flat cluster number to which original observation i belongs." I.e., the cluster.hierarchy module flattens the clustering according to a threshold which you set at 0.5*distances.max(). In your case, the third and fifth document are clustered together, but all the others form clusters of their own, so you might want to set the threshold higher or using a different criterion.

这篇关于用scipy聚类-通过距离矩阵聚类,如何取回原始对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆