Scipy.cluster.hierarchy.fclusterdata +距离度量 [英] Scipy.cluster.hierarchy.fclusterdata + distance measure

查看:397
本文介绍了Scipy.cluster.hierarchy.fclusterdata +距离度量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

1)我正在使用scipy的hcluster模块。

1) I am using scipy's hcluster module.

所以我可以控制的变量是阈值变量。
如何知道每个阈值的表现?即在Kmeans中,此性能将是它们质心的所有点的总和。当然,必须调整此值,因为通常会有更多的簇=较少的距离。

so the variable that I have control over is the threshold variable. How do I know my performance per threshold? i.e. In Kmeans, this performance will be the sum of all the points to their centroids. Of course, this has to be adjusted since more clusters = less distance generally.

是否可以使用hcluster对此进行观察?

Is there an observation that I can do with hcluster for this?

2)我意识到ftclusterdata有大量指标可用。我正在基于关键术语的tf-idf聚类文本文档。重要的是,某些文档比其他文档更长,我认为余弦是规范化此长度问题的好方法,因为文档越长,其在n维字段中的方向应该保持不变内容是一致的。有人可以建议其他方法吗?我如何评估?

2) I am realize there are tons of metrics available for fclusterdata. I am clustering of text documents based on tf-idf of key terms. The deal is, some document are longer than others, and I think that cosine is a good way to "normalize" this length issue because the longer a document are, its "direction" in a n-dimensional field SHOULD stay the same if they content is consistent. Are there any other methods someone can suggest? How can I evaluate?

Thx

推荐答案

一个可以计算平均值距离| x-群集中心|就像x均值一样。
以下是这种蛮力。 (它必须是scipy.cluster或scipy.spatial.distance中的内置
,但我也找不到。)

One can calculate average distances |x - cluster centre| for x in cluster, just as for K-means. The following does this brute-force. (It must be a builtin in scipy.cluster or scipy.spatial.distance but I can't find it either.)

关于您的问题2,通过。

On your question 2, pass. Any links to good tutorials on hierarchical clustering would be welcome.

#!/usr/bin/env python
""" cluster cities: pdist linkage fcluster plot
    util: clusters() avdist()
"""

from __future__ import division
import sys
import numpy as np
import scipy.cluster.hierarchy as hier  # $scipy/cluster/hierarchy.py
import scipy.spatial.distance as dist
import pylab as pl
from citiesin import citiesin  # 1000 US cities

__date__ = "27may 2010 denis"

def clusterlists(T):
    """ T = hier.fcluster( Z, t ) e.g. [a b a b a c]
        -> [ [0 2 4] [1 3] [5] ] sorted by len
    """
    clists = [ [] for j in range( max(T) + 1 )]
    for j, c in enumerate(T):
        clists[c].append( j )
    clists.sort( key=len, reverse=True )
    return clists[:-1]  # clip the []

def avdist( X, to=None ):
    """ av dist X vecs to "to", None: mean(X) """
    if to is None:
        to = np.mean( X, axis=0 )
    return np.mean( dist.cdist( X, [to] ))

#...............................................................................
Ndata = 100
method = "average"
t = 0
crit = "maxclust"
    # 'maxclust': Finds a minimum threshold `r` so that the cophenetic distance
    # between any two original observations in the same flat cluster
    # is no more than `r` and no more than `t` flat clusters are formed.
    # but t affects cluster sizes only weakly ?
    # t 25: [10, 9, 8, 7, 6
    # t 20: [12, 11, 10, 9, 7
plot = 0
seed = 1

exec "\n".join( sys.argv[1:] )  # Ndata= t= ...
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True )  # .2f
me = __file__.split('/') [-1]

    # biggest US cities --
cities = np.array( citiesin( n=Ndata )[0] )  # N,2

if t == 0:  t = Ndata // 4

#...............................................................................
print "# %s  Ndata=%d  t=%d  method=%s  crit=%s " % (me, Ndata, t, method, crit)

Y = dist.pdist( cities )  # n*(n-1) / 2
Z = hier.linkage( Y, method )  # n-1
T = hier.fcluster( Z, t, criterion=crit )  # n

clusters = clusterlists(T)
print "cluster sizes:", map( len, clusters )
print "# average distance to centre in the biggest clusters:"
for c in clusters:
    if len(c) < len(clusters[0]) // 3:  break
    cit = cities[c].T
    print "%.2g %s" % (avdist(cit.T), cit)
    if plot:
        pl.plot( cit[0], cit[1] )

if plot:
    pl.title( "scipy.cluster.hierarchy of %d US cities, %s t=%d" % (
        Ndata, crit, t) )
    pl.grid(False)
    if plot >= 2:
        pl.savefig( "cities-%d-%d.png" % (Ndata, t), dpi=80 )
    pl.show()

这篇关于Scipy.cluster.hierarchy.fclusterdata +距离度量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆