K-均值:Lloyd,Forgy,MacQueen,Hartigan-Wong [英] K-Means: Lloyd,Forgy,MacQueen,Hartigan-Wong

查看:132
本文介绍了K-均值:Lloyd,Forgy,MacQueen,Hartigan-Wong的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R中的K均值算法,我想弄清楚stats软件包中的"kmeans"功能可使用的4种算法Lloyd,Forgy,MacQueen和Hartigan-Wong的区别.

I'm working with the K-Means Algorithm in R and I want to figure out the differences of the 4 Algorithms Lloyd,Forgy,MacQueen and Hartigan-Wong which are available for the function "kmeans" in the stats package.

但是我值得注意的是,这个问题得到了足够的答案.

However I was notable to get a sufficient answer to this question.

我只发现了一些很少的信息: (访问 http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/K-Means)

I only found some rarely information: (Visit http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/K-Means)

根据此描述,劳埃德(Lloyd),福吉(Forgy)和哈蒂根·王(Hartigan-Wong)对我来说似乎是相同的.最小化平方和以内或最小化欧氏距离是相同的.

From this description Lloyd, Forgy, and Hartigan-Wong seem the same to me. Minimizing the within sum of squares or Minimizing the Euclidean Distance is the same.

如果我将对象移动到另一个群集,MacQueen会更新两个涉及的群集.

MacQueen is different in the case that it updates the two involved clusters if a object is moved to another cluster if I'm right.

尽管如此,我仍然看不到这些算法在哪些方面有所不同.

Nevertheless, I still do not see in which points these Algorithms are different.

推荐答案

R 提供Lloyd算法作为kmeans()的选项;默认算法,由 Hartigan和Wong(1979)更聪明.就像MacQueen的算法(MacQueen,1967年)一样, 每当移动点时,它都会更新质心;它还可以做出明智的选择(省时) 在检查最近的群集.另一方面,劳埃德(Lloyd)的k-means算法是所有这些聚类算法中的第一个也是最简单的.

R provides Lloyd's algorithm as an option to kmeans(); the default algorithm, by Hartigan and Wong (1979) is much smarter. Like MacQueen's algorithm (MacQueen, 1967), it updates the centroids any time a point is moved; it also makes clever (time-saving) choices in checking for the closest cluster. On the other hand Lloyd's k-means algorithm is the first and simplest of all these clustering algorithms.

Lloyd的算法(Lloyd,1957年)采用了一组观察值或案例(例如: nxp矩阵或Reals中的点)并将它们聚类为k组.它试图最小化 集群内平方和

Lloyd's algorithm (Lloyd, 1957) takes a set of observations or cases (think: rows of an nxp matrix, or points in Reals) and clusters them into k groups. It tries to minimize the within-cluster sum of squares

其中,u_i是群集S_i中所有点的平均值.算法进行如下(我将 为您省去详尽的记号的形式):

where u_i is the mean of all the points in cluster S_i. The algorithm proceeds as follows (I'll spare you the formality of the exhaustive notation):

但是,R的实现存在问题,并且当 考虑多个起点.我应该注意,通常谨慎考虑 几个不同的起点,因为可以保证算法收敛,但是不能收敛 保证覆盖全球最佳状态.对于大型,高尺寸的情况尤其如此 问题.我将从一个简单的例子开始(大而不是特别困难).

There is a problem with R's implementation, however, and the problem arises when considering multiple starting points. I should note that it's generally prudent to consider several di erent starting points, because the algorithm is guaranteed to converge, but is not guaranteed to coverge to a global optima. This is particularly true for large, high-dimensional problems. I'll start with a simple example (large, not particularly dicult).

(这里我将粘贴一些图像,因为我们无法使用乳胶编写数学公式)

(Here I will paste some images as we can not write mathematical formulaswith latex)

请注意,该解决方案与之前实现的解决方案非常相似,尽管 簇是任意的.更重要的是,作业并行仅需0.199秒!一定 这太好了,难以置信:使用3个处理器内核最多应该占用三分之一的处理器内核. 我们的第一个(顺序)运行时间.这有问题吗?听起来像免费午餐.没有 偶尔有免费午餐的问题,在吗?

Note that the solution is very similar to the one achieved earlier, although the ordering of the clusters is arbitrary. More importantly, the job only took 0.199 seconds in parallel! Surely this is too good to be true: using 3 processor cores should, at best, taken one third of the time of our fi rst (sequential) run. Is this a problem? It sounds like free lunch. There is no problem with a free lunch once in a while, is there?

这并不总是与R函数一起使用,但是有时候我们有机会直接看一下 在代码.这是那些时代之一.我将这段代码放入文件mykmeans.R, 并手动编辑它,在不同位置插入cat()语句.这是一个聪明的方法 这可以通过使用sink()来实现(尽管在Sweave中似乎不起作用,但可以交互地起作用):

This doesn't always work with R functions, but sometimes we have a chance to look directly at the code. This is one of those times. I'm going to put this code into file, mykmeans.R, and edit it by hand, inserting cat() statements in various places. Here's a clever way to do this, using sink() (although this doesn't seem to work in Sweave, it will work interactively):

> sink("mykmeans.R")
> kmeans
> sink()

现在编辑文件,更改函数名称并添加cat()语句.笔记 您还必须删除尾行::

Now editing the fi le, changing the function name and adding cat() statements. Note that you also have to delete a trailing line: :

然后我们可以重复进行探索,但是要使用mykmeans():

We can then repeat our explorations, but using mykmeans():

> source("mykmeans.R")
> start.kmeans <- proc.time()[3]
> ans.kmeans <- mykmeans(x, 4, nstart = 3, iter.max = 10, algorithm = "Lloyd")
JJJ statement 1: 0 elapsed time.
JJJ statement 5: 2.424 elapsed time.
JJJ statement 6: 2.425 elapsed time.
JJJ statement 7: 2.52 elapsed time.
JJJ statement 6: 2.52 elapsed time.
JJJ statement 7: 2.563 elapsed time.

现在我们开始做生意:大部分时间都在陈述5之前被消耗掉了(我知道 当然,这就是为什么陈述5是5而不是2)的原因... 您可以继续玩

Now we're in business: most of the time was consumed before statement 5 (I knew this of course, which is why statement 5 was 5 rather than 2)... You can keep on playing with it

这是代码:

#######################################################################
# kmeans()

N <- 100000
x <- matrix(0, N, 2)
x[seq(1,N,by=4),] <- rnorm(N/2)
x[seq(2,N,by=4),] <- rnorm(N/2, 3, 1)
x[seq(3,N,by=4),] <- rnorm(N/2, -3, 1)
x[seq(4,N,by=4),1] <- rnorm(N/4, 2, 1)
x[seq(4,N,by=4),2] <- rnorm(N/4, -2.5, 1)
start.kmeans <- proc.time()[3]
ans.kmeans <- kmeans(x, 4, nstart=3, iter.max=10, algorithm="Lloyd")
ans.kmeans$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

these <- sample(1:nrow(x), 10000)
plot(x[these,1], x[these,2], pch=".")
points(ans.kmeans$centers, pch=19, cex=2, col=1:4)

library(foreach)
library(doMC)
registerDoMC(3)
start.kmeans <- proc.time()[3]
ans.kmeans.par <- foreach(i=1:3) %dopar% {
  return(kmeans(x, 4, nstart=1, iter.max=10, algorithm="Lloyd"))
}
TSS <- sapply(ans.kmeans.par, function(a) return(sum(a$withinss)))
ans.kmeans.par <- ans.kmeans.par[[which.min(TSS)]]
ans.kmeans.par$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

sink("mykmeans.Rfake")
kmeans
sink()

source("mykmeans.R")
start.kmeans <- proc.time()[3]
ans.kmeans <- mykmeans(x, 4, nstart=3, iter.max=10, algorithm="Lloyd")
ans.kmeans$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

#######################################################################
# Diving

x <- read.csv("Diving2000.csv", header=TRUE, as.is=TRUE)
library(YaleToolkit)
whatis(x)

x[1:14,c(3,6:9)]

meancol <- function(scores) {
  temp <- matrix(scores, length(scores)/7, ncol=7)
  means <- apply(temp, 1, mean)
  ans <- rep(means,7)
  return(ans)
}
x$panelmean <- meancol(x$JScore)

x[1:14,c(3,6:9,11)]

meancol <- function(scores) {
  browser()
  temp <- matrix(scores, length(scores)/7, ncol=7)
  means <- apply(temp, 1, mean)
  ans <- rep(means,7)
  return(ans)
}

x$panelmean <- meancol(x$JScore)

这里是描述:

Number of cases: 10,787 scores from 1,541 dives (7 judges score each
dive) performed in four events at the 2000 Olympic Games in Sydney,
Australia.

Number of variables: 10.

Description: A full description and analysis is available in an
article in The American Statistician (publication details to be
announced).

Variables:

Event       Four events, men's and women's 3M and 10m.
Round       Preliminary, semifinal, and final rounds.
Diver       The name of the diver.
Country     The country of the diver.
Rank        The final rank of the diver in the event.
DiveNo      The number of the dive in sequence within round.
Difficulty  The degree of difficulty of the dive.
JScore      The score provided for the judge on this dive.
Judge       The name of the judge.
JCountry    The country of the judge.

和数据集进行实验 https://www.dropbox.com/s/urgzagv0a22114n/Diving2000.csv

这篇关于K-均值:Lloyd,Forgy,MacQueen,Hartigan-Wong的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆