使用遗传算法优化K均值聚类 [英] Optimizing K-means clustering using Genetic Algorithm
问题描述
我有以下dataset
(在此处获得):
----------item survivalpoints weight
1 pocketknife 10 1
2 beans 20 5
3 potatoes 15 10
4 unions 2 1
5 sleeping bag 30 7
6 rope 10 5
7 compass 30 1
我可以使用二进制字符串作为我最初选择的中心,使用kmeans()
将此数据集分为三个群集.例如:
I can cluster this dataset into three clusters with kmeans()
using a binary string as my initial choice of centers. For eg:
## 1 represents the initial centers
chromosome = c(1,1,1,0,0,0,0)
## exclude first column (kmeans only support continous data)
cl <- kmeans(dataset[, -1], dataset[chromosome == 1, -1])
## check the memberships
cl$clusters
# [1] 1 3 3 1 2 1 2
使用这个基本概念,我在GA
软件包中对其进行了尝试,以便在我试图优化(最小化)戴维斯-布尔登(DB)索引的地方进行搜索.
Using this fundamental concept, I tried it out with GA
package to conduct the search where I am trying to optimize(minimize) Davies-Bouldin (DB) Index.
library(GA) ## for ga() function
library(clusterSim) ## for index.DB() function
## defining my fitness function (Davies-Bouldin)
DBI <- function(x) {
## converting matrix to vector to access each row
binary_rep <- split(x, row(x))
## evaluate the fitness of each chromsome
for(each in 1:nrow(x){
cl <- kmeans(dataset, dataset[binary_rep[[each]] == 1, -1])
dbi <- index.DB(dataset, cl$cluster, centrotypes = "centroids")
## minimizing db
return(-dbi)
}
}
g<- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
当然(我不知道发生了什么),我收到了以下错误消息:
Warning messages:
Error in row(x) : a matrix-like object is required as argument to 'row'
Of course (I have no idea what's happening), I received error message of
Warning messages:
Error in row(x) : a matrix-like object is required as argument to 'row'
这是我的问题:
- 如何正确使用
GA
软件包来解决我的问题? - 如何确定随机生成的染色体包含与
k
个簇数相对应的相同数量的1
(例如,如果k=3
则该染色体必须恰好包含三个1
)?
- How can correctly use the
GA
package to solve my problem? - How can I make sure the randomly generated chromosomes contains the same number of
1
s which corresponds tok
number of clusters (eg. ifk=3
then the chromosome must contain exactly three1
s)?
推荐答案
我无法评论将k均值与ga相结合的感觉,但我可以指出,您的适应度函数存在问题.同样,当所有基因都打开或关闭时也会产生错误,因此仅在不是这种情况时才计算适合度:
I can't comment on the sense of combining k-means with ga, but I can point out that you had issue in your fitness function. Also, errors are produced when all genes are on or off, so fitness is only calculated when that is not the case:
DBI <- function(x) {
if(sum(x)==nrow(dataset) | sum(x)==0){
score <- 0
} else {
cl <- kmeans(dataset[, -1], dataset[x==1, -1])
dbi <- index.DB(dataset[,-1], cl=cl$cluster, centrotypes = "centroids")
score <- dbi$DB
}
return(score)
}
g <- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
plot(g)
g@solution
g@fitnessValue
好像几种基因组合产生了相同的最佳"适应度值
Looks like several gene combinations produced the same "best" fitness value
这篇关于使用遗传算法优化K均值聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!