为scikit学习K均值聚类部分定义初始质心 [英] partially define initial centroid for scikit-learn K-Means clustering

查看:61
本文介绍了为scikit学习K均值聚类部分定义初始质心的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Scikit文档指出:

Scikit documentation states that:

初始化方法

"k-means ++":以一种明智的方式为k-mean聚类选择初始聚类中心,以加快收敛速度​​.有关更多详细信息,请参见k_init中的注释部分.

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

如果通过ndarray,则其形状应为n_clusters,n_features,并给出初始中心.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

我的数据有10个(预测的)群集和7个功能.但是,我想传递10 x 6形状的数组,即我想由我预定义6个质心维,但是要使用k-mean ++自由地迭代第7维.(换句话说,我不想指定初始质心,而是控制6维,并且只保留一个维以改变初始聚类)

My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)

我尝试通过10x6尺寸,希望它可以工作,但是只会引发错误.

I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.

推荐答案

Sklearn不允许您执行此类精细操作.

Sklearn does not allow you to perform this kind of fine operations.

唯一的可能性是提供第7个特征值,该值是随机的,或与Kmeans ++会达到的相似.

The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.

因此,基本上,您可以为此估算一个不错的值,如下所示:

So basically you can estimate a good value for this as follows:

import numpy as np
from sklearn.cluster import KMeans

nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )   

# your 6col centroids  
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) ) 

# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols

# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])

# For the 7th column you'll provide the average value 
# of the points laying on the cluster given by your partial centroids    
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
    init_7th = X[ np.where( initial_prediction == i ), 6].mean()
    cent_7cols[i,6] =  init_7th

# now you have initialized the 7th column with a Kmeans ++ alike 
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )

这篇关于为scikit学习K均值聚类部分定义初始质心的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆