我在R中自己的K均值算法 [英] My own K-means algorithm in R

查看:70
本文介绍了我在R中自己的K均值算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R编程的初学者,我正在R中做此练习,作为对R编程的介绍.我已经在R中实现了自己的K均值实现,但是在某一点上停留了一段时间:我需要达成共识,算法要进行迭代,直到找到每个群集的最佳中心为止.

I am a beginner at R programming and I am doing this exercise in R as an intro to programming. I have made my own K means implementation in R, but have been stuck for a while at a one point: I need to make a consensus, where the algorithm iterates until it finds the optimal center of each cluster.

这是未经迭代的原始算法.只是以整个数据中的一个随机数据点为中心,该数字由k定义.

This is the raw algorithm without iteration. It just take a random data point from the whole data as a center, which number is defined by k.

Centroid_test=data[sample(nrow(data), k), ]
x = Centroid_test
y = data
m=apply(data,1,function(data)   (apply(Centroid_test,1,function(Centroid_test,y)
dist(rbind(Centroid_test,data)),data)))
colnames(m)=rownames(y)
minByCol <- apply(m, MARGIN=2, FUN=which.min)
minByColdf=as.data.frame(minByCol)
MasterDataframe=data.frame(data,minByColdf)
Sort_Master=MasterDataframe[ order(MasterDataframe[,3], MasterDataframe[,3]), ]
res=data.frame(Sort_Master)
cen=Centroid_test
rownames(cen)=1:k
res
cen

因此,我有一些群集中心和每个群集附带的数据点,但这不是最佳中心.我怎样才能找到好的中心?

So, I have some cluster centers and data points accompanied to each cluster, but it is not the optimal center. How can I find the good centers?

我的尝试在下面.我知道我必须迭代上面的代码,让 再说kmax次,直到满足一个条件,该条件将停止迭代,从而为数据提供最佳的聚类:

My attempt is below. I know that I have to iterate the above code, for lets say kmax times, until it meets a condition that would be stop the iteration and thus give the best cluster to fit the data:

for (n in 1:kmax){

  if (condition)
    break;
}

但是如何定义条件?在阅读了大约k个均值之后,一个想法是找到一个值最接近其组均值的中心.我编写了以下代码:

But how do I define the condition? After reading a bit about k means, one idea was to find a center which value is the closest to the mean of its group.I wrote this bit of code:

kn=1
group=subset(res, res[,3] == 1)
mean(group$x)
mean(group$y)
cen[kn,]$x
cen[kn,]$y

但是我不知道如何用越相似的意思"写代码.我发现的另一个想法是找到距离最小的集群 从每个角度来看.我想不出如何将其成功写入代码.

But I do not know how to write in code "the more similar the mean". Another idea I found was to find the cluster that has the minimum distance from each point. I could not think how could I write this into code successfully.

如果您可以向我展示方法或分享想法,那将非常有帮助!

If you could show me how or share an idea, that would be very helpful!

非常感谢!

要澄清:

因此,我想要做的是一种算法,该算法将针对每个群集的中心和点之间的距离找到最佳的群集中心.阅读了更多关于k-means算法的信息后,我发现有Forgy/Lloyd算法,MacQueen算法和Hartigan& amp; amp;.黄算法.每个人都试图用不同的方法找到最佳中心.

So, what I want is to do some sort of algorithm that will find the optimal centers of clusters with regard to the distance between the center and points of each cluster. After reading more about k-means algorithms, I found there are the Forgy/Lloyd algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm. Each one tries to find the optimal center with different approaches.

上面的代码将随机点分配为中心,然后计算每个点到每个中心的距离,并且到一个点的距离最小的点将分配给该点簇. cen包含每个聚类的中心,而res给出分配给每个聚类的所有点(这就是第三列的含义).

The above code assigns random points as centers, and then calculates the how far is each point to each centers, and the points with the minimal distance from a point, gets to be assigned to that points cluster. cen contains the centers of each cluster, and res gives the all the points assigned to each cluster(thats what the third column is for).

我的想法是先将组中的每个点分组后再计算其到中心的距离,然后将其保存到数据框或其他内容中.下一步是重新做一遍:找到新的随机中心,再为每个中心分配点,形成聚类,最后计算点与中心之间的距离,以再次保存它们. 最后,将出现一个具有许多距离(例如,经过100次迭代)的数据框或矩阵,然后我们可以找到在每个点与聚类中心之间具有最小距离的中心.这些与其他点的距离最小的点是群集的最佳中心.

My idea was to calculate first the distance of each point of the group to center after being grouped into clusters, and save it to a data frame or something else. The next step would be to do all again: find new random centers, assign again points to each center, form the clusters and finally calculate the distance between the points and centers, to save them again. In the end there will be a data frame or matrix with many ( for example after 100 iterations), distances and then we could find the centers that gave the smallest distance between each point and the cluster center. These points with the minimal distance to the other points are the optimal centers of clusters.

虚拟数据:

y=rnorm(500,1.65)
x=rnorm(500,1.15)

data=cbind(x,y)

运行上述代码后,运行plot以查看群集的中心:

After running the above code, run plot to see the centers of cluster:

plot(data)
points(cen, pch=21,bg=23)

推荐答案

用于计算欧几里得距离的函数:

The function for calculating the Euclidean distance:

euclid <- function(points1, points2) {
  distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
  for(i in 1:nrow(points2)) {
    distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
  }
  distanceMatrix
}

使用上面的欧几里得距离的K均值算法:

The K means algorithm that uses the Euclidean distance above:

K_means <- function(x, centers, distFun, nItter) {
  clusterHistory <- vector(nItter, mode="list")
  centerHistory <- vector(nItter, mode="list")

  for(i in 1:nItter) {
    distsToCenters <- distFun(x, centers)
    clusters <- apply(distsToCenters, 1, which.min)
    centers <- apply(x, 2, tapply, clusters, mean)
    # Saving history
    clusterHistory[[i]] <- clusters
    centerHistory[[i]] <- centers
  }

  list(clusters=clusterHistory, centers=centerHistory)
}

准备数据:

test=data # A data.frame
ktest=as.matrix(test) # Turn into a matrix
centers <- ktest[sample(nrow(ktest), 5),] # Sample some centers, 5 for example

结果

res <- K_means(ktest, centers, euclid, 10)

这篇关于我在R中自己的K均值算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆