K均值:初始中心不明显 [英] K-means: Initial centers are not distinct

查看:215
本文介绍了K均值:初始中心不明显的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 GA软件包,我的目标是为k均值聚类算法找到最佳的初始质心位置.我的数据是TF-IDF分数中的单词稀疏矩阵,可以在此处下载.以下是我已实现的一些阶段:

I am using the GA Package and my aim is to find the optimal initial centroids positions for k-means clustering algorithm. My data is a sparse-matrix of words in TF-IDF score and is downloadable here. Below are some of the stages I have implemented:

0.库和数据集

library(clusterSim)           ## for index.DB()
library(GA)                   ## for ga() 

corpus <- read.csv("Corpus_EnglishMalay_tfidf.csv")     ## a dataset of 5000 x 1168

1.二进制编码并生成初始填充.

k_min <- 15

initial_population <- function(object) {
    ## generate a population to turn-on 15 cluster bits
    init <- t(replicate(object@popSize, sample(rep(c(1, 0), c(k_min, object@nBits - k_min))), TRUE))
    return(init)
}

2.适应度功能可最小化Davies-Bouldin(DB)指数.我在哪里评估从initial_population生成的每个解决方案的DBI.

2. Fitness Function Minimizes Davies-Bouldin (DB) Index. Where I evaluate DBI for each solution generated from initial_population.

DBI2 <- function(x) {
    ## x is a vector of solution of nBits 
    ## exclude first column of corpus
    initial_centroid <- corpus[x==1, -1]
    cl <- kmeans(corpus[-1], initial_centroid)
    dbi <- index.DB(corpus[-1], cl=cl$cluster, centrotypes = "centroids")
    score <- -dbi$DB
    return(score) 
}

3.正在运行GA.使用这些设置.

g2<- ga(type = "binary", 
    fitness = DBI2, 
    population = initial_population,
    selection = ga_rwSelection,
    crossover = gabin_spCrossover,
    pcrossover = 0.8,
    pmutation = 0.1,
    popSize = 100, 
    nBits = nrow(corpus),
    seed = 123)

4.问题. kmeans(corpus [-1],initial_centroid)中的错误:初始中心不明显.

4. The problem. Error in kmeans(corpus[-1], initial_centroid) : initial centers are not distinct`.

我在此处发现了类似的问题,用户还必须使用参数来动态地传递要使用的群集数.通过硬编码集群的数量来解决.但是对于我的情况,我真的需要动态传递簇的数量,因为它来自随机生成的二进制向量,其中1's代表初始质心.

I found a similar problem here, where the user also had to used a parameter to dynamically pass in the number of clusters to use. It was solve by hard-coding the number of clusters. However for my case, I really need to dynamically pass in the number of clusters, since it is coming in from a randomly generated binary vector, where those 1's will represent the initial centroids.

使用kmeans() 代码,我注意到该错误是由重复的中心引起的:

Checking with the kmeans() code, I noticed that the error is caused by duplicated centers:

if(any(duplicated(centers)))
        stop("initial centers are not distinct")

我用trace编辑了kmeans功能,以打印出重复的中心.输出:

I edited the kmeans function with trace to print out the duplicated centers. The output:

 [1] "206"  "520"  "564"  "1803" "2059" "2163" "2652" "2702" "3195" "3206" "3254" "3362" "3375"
[14] "4063" "4186"

在随机选择的initial_centroids中没有显示重复,我也不知道为什么此错误不断发生.还有什么会导致此错误的?

Which shows no duplication in the randomly selected initial_centroids and I have no idea why this error keeps occurring. Is there anything else that would lead to this error?

P/S:我确实知道有些人可能认为GA + K-均值不是一个好主意.但是我确实希望完成我的开始.最好将此问题视为K均值问题(至少可以解决initial centers are not distinct错误).

P/S: I do understand some may suggest GA + K-means is not a good idea. But I do hope to finish what I have started. It is better to view this problem as a K-means problem (well at least in solving the initial centers are not distinct error).

推荐答案

根据问题的性质,遗传算法不太适合优化k均值-初始化种子相互作用太多,ga不会比随机抽取更好所有可能的种子的样本.

Genetic algorithms are not well suited for optimizing k-means by the nature of the problem - initialization seeds interact too much, ga will not be better than taking a random sample of all possible seeds.

所以我的主要建议是完全不要使用遗传算法!

So my main advise is to not use genetic algorithms at all here!

如果您坚持认为,您需要做的就是检测错误的参数,然后只需为错误的初始化返回错误的分数即可,这样它们就不会幸存".

If you insist, what you would need to do is detect the bad parameters, then simply return a bad score for bad initialization so they don't "survive".

这篇关于K均值:初始中心不明显的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆