在k均值聚类后为新数据分配聚类的简单方法 [英] Simple approach to assigning clusters for new data after k-means clustering

查看:166
本文介绍了在k均值聚类后为新数据分配聚类的简单方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在数据框df1上运行k-均值聚类,并且我正在寻找一种简单的方法来为新数据框df2(具有相同的变量名)中的每个观测值计算最近的聚类中心.将df1视为训练集,将df2视为测试集;我想对训练集进行聚类,然后将每个测试点分配给正确的聚类.

I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable names). Think of df1 as the training set and df2 on the testing set; I want to cluster on the training set and assign each test point to the correct cluster.

我知道如何使用apply函数和一些简单的用户定义函数(有关该主题的先前文章通常提出过类似的建议)来做到这一点:

I know how to do this with the apply function and a few simple user-defined functions (previous posts on the topic have usually proposed something similar):

df1 <- data.frame(x=runif(100), y=runif(100))
df2 <- data.frame(x=runif(100), y=runif(100))
km <- kmeans(df1, centers=3)
closest.cluster <- function(x) {
  cluster.dist <- apply(km$centers, 1, function(y) sqrt(sum((x-y)^2)))
  return(which.min(cluster.dist)[1])
}
clusters2 <- apply(df2, 1, closest.cluster)

但是,我正在为学生不熟悉apply函数的课程准备此集群示例,因此,如果我可以使用内置函数将集群分配给df2,我将更愿意.是否有任何方便的内置函数来查找最近的群集?

However, I'm preparing this clustering example for a course in which students will be unfamiliar with the apply function, so I would much prefer if I could assign the clusters to df2 with a built-in function. Are there any convenient built-in functions to find the closest cluster?

推荐答案

您可以使用 flexclust 软件包,该软件包具有用于k-means的已实现的predict方法:

You could use the flexclust package, which has an implemented predict method for k-means:

library("flexclust")
data("Nclus")

set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)

dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE

cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1    
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
#  1   2   3   4 
#130 181  98  91 

pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])

image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")

还有一些转换方法可以将结果从簇函数(如stats::kmeanscluster::pam)转换为类kcca的对象,反之亦然:

There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:

as.kcca(cl, data=x)
# kcca object of family ‘kmeans’ 
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
#  1  2 
#  50 50 

这篇关于在k均值聚类后为新数据分配聚类的简单方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆