手动定义集群中心 [英] Define cluster centers manually

查看:37
本文介绍了手动定义集群中心的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在进行Kmeans聚类分析时,如何手动定义某个聚类中心?例如,我想说我的聚类中心是[1,2,3]和[3,4,5],现在我想将向量聚类到预定义的中心.

Doing Kmeans cluster analysis, how to I manually define a certain cluster-center? For example I want to say my cluster centers are [1,2,3] and [3,4,5] and now I want to cluster my vectors to the predefined centers.

类似于 kmeans.cluster_centers_ = [[1,2,3],[3,4,5]] 吗?

要解决我的问题,这就是我在atm所做的事情:

to work around my problem thats what I do atm:

number_of_clusters = len(vec)
kmeans = KMeans(number_of_clusters, init='k-means++', n_init=100)
kmeans.fit(vec)

它基本上为每个向量定义一个聚类.但是,由于我有成千上万个向量/句子,因此计算需要花费很多时间.必须有一个选项可以将矢量坐标直接设置为聚类坐标,而无需使用kmeans算法进行计算.(因为中心输出基本上是我运行算法后的矢量坐标...)

it basically defines a cluster for each vector. But it takes ages to compute as I have thousands of vectors/sentences. There must be an option to set the vector coordinates directly as cluster coordinates without the need to compute them with the kmeans algorithm. (as the center outputs are basically the vector coordinates after i run the algorithm...)

编辑以更具体地说明我的任务:

Edit to be more specific about my task:

所以我想要的是我有大量的矢量(从句子生成),现在我想将它们聚类.但是想象一下我有两列句子,并且总是想将B列句子排序为A列句子.不是一个专栏互相句子.这就是为什么我要为A列向量设置聚类中心,然后再将最直接的B向量预测到这些中心.希望有道理

So what I do want is I have tonns of vectors ( generated from sentences) and now I want to cluster these. But imagine I have two columns of sentences and always want to sort a B column sentence to an A column sentence. Not A column sentences to each other. Thats why I want to set cluster centers for the A column vectors and afterwards predict the clostest B vectors to these Centers. Hope that makes sense

我正在使用sklearn kmeans atm

I am using sklearn kmeans atm

推荐答案

我想我知道你想做什么.因此,您想使用一些已知示例为k-Means手动选择质心,然后执行聚类以将最接近的数据点分配给预定义的质心.

I think I know what you want to do. So you want to manually select the centroids for k-Means with some known examples and then perform the clustering to assign the closests data points to your pre-defined centroids.

您要查找的参数是名为 init 的k-Means初始化,请参见

The parameter you are looking for is the k-Means initialization named as init see documentation.

我准备了一个小例子,可以做到这一点.

I have prepared a small example that would do exactly this.

import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial import distance_matrix

# 5 datapoints with 3 features
data = [[1, 0, 0],
        [1, 0.2, 0],
        [0, 0, 1],
        [0, 0, 0.9],
        [1, 0, 0.1]]

X = np.array(data)

distance_matrix(X,X)

成对距离矩阵显示最接近的示例.

The pairwise distance matrix shows which examples are the closests.

> array([[0.        , 0.2       , 1.41421356, 1.3453624 , 0.1       ],
>       [0.2       , 0.        , 1.42828569, 1.36014705, 0.2236068 ],
>       [1.41421356, 1.42828569, 0.        , 0.1       , 1.3453624 ],
>       [1.3453624 , 1.36014705, 0.1       , 0.        , 1.28062485],
>       [0.1       , 0.2236068 , 1.3453624 , 1.28062485, 0.        ]])

您可以选择某些数据点作为您的初始质心

you can select certain data points to be used as your initial centroids

centroid_idx = [0,2] # let data point 0 and 2 be our centroids
centroids = X[centroid_idx,:]
print(centroids) # [[1. 0. 0.]
                 # [0. 0. 1.]]

kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1) # just run one k-Means iteration so that the centroids are not updated

kmeans.fit(X)
kmeans.labels_

>>> array([0, 0, 1, 1, 0], dtype=int32)

如您所见,k-Means按预期标记了数据点.如果要更新质心,则可能需要省略 max_iter 参数.

As you can see k-Means labels the data points as expected. You might want to omit the max_iter parameter if you want your centroids to be updated.

这篇关于手动定义集群中心的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆