在内核k均值(R中的kernlab软件包)中将新数据点分配给群集? [英] Assign new data point to cluster in kernel k-means (kernlab package in R)?
问题描述
我对R的kernlab
软件包中的kkmeans
函数有疑问.我是此软件包的新手,如果我在这里缺少明显的内容,请原谅我.
I have a question about the kkmeans
function in the kernlab
package of R. I am new to this package and please forgive me if I'm missing something obvious here.
我想为一组新的数据点分配一个群集,这些群集是使用具有功能"kkmeans"的内核k-means创建的.对于常规聚类,可以通过计算新数据点与聚类质心之间的欧几里得距离来实现,并选择具有最接近质心的聚类.在内核k均值中,必须在特征空间中执行此操作.
I would like to assign a new data point to a cluster in a set of clusters that were created using kernel k-means with the function 'kkmeans'. With regular clustering, one would do this by calculating the Euclidian distance between the new data point and the cluster centroids, and choose the cluster with the closest centroid. In kernel k-means, one must do this in the feature space.
以kkmeans描述中使用的示例为例:
Take the example used in the kkmeans description:
data(iris)
sc <- kkmeans(as.matrix(iris[,-5]), centers=3)
假设我在这里有一个新的数据点,我想将其分配给上面在sc
中创建的最近的群集.
Say that I have a new data point here, which I would like to assign to the closest cluster created above in sc
.
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.0 3.6 1.2 0.4
有关如何执行此操作的任何提示?非常感谢您的帮助.
Any tips on how to do this? Your help is very appreciated.
推荐答案
内核K-means使用内核功能来计算对象的相似性.在简单的k均值中,您遍历所有质心并选择一个最小化到给定数据点的距离(在使用的度量标准下)的质心.如果使用内核方法(kkmeans
中的默认内核函数为radial basis function
),则只需遍历质心并选择最大化内核函数值的一个(对于RBF)或最小化内核引起的距离(对于任何内核). 此处提供了将内核转换为距离量度的详细说明-通常是由内核K
可以通过d^2(a,b) = K(a,a)+K(b,b)-2K(a,b)
进行计算,但是对于RBF,对于所有x
而言,K(x,x)=1
都可以将K(a,b)
最大化,而不是将整个K(a,a)+K(b,b)-2K(a,b)
最小化.
Kernel K-means uses the Kernel function to calculate similarity of objects. In the simple k-means you loop through all centroids and select the one which minimizes the distance (under used metric) to the given data point. In case of kernel method (default kernel function in kkmeans
is radial basis function
), you simply loop through centroids and select the one that maximizes the kernel function value (in case of RBF) or minimizes the kernel induced distance (for any kernel). Detailed description of converting kernel to distance measure is provided here - in general distance induced by kernel K
can be calculated through d^2(a,b) = K(a,a)+K(b,b)-2K(a,b)
, but as in case of RBF, K(x,x)=1
for all x
, you can just maximize the K(a,b)
instead of minimizing the whole K(a,a)+K(b,b)-2K(a,b)
.
要从kkmeans
对象获取内核函数,可以使用kernelf
函数
To get the kernel function from kkmeans
object you can use kernelf
function
> data(iris)
> sc <- kkmeans(as.matrix(iris[,-5]), centers=3)
> K = kernelf(sc)
因此,您的示例
> c=centers(sc)
> x=c(5.0, 3.6, 1.2, 0.4)
> K(x,c[1,])
[,1]
[1,] 1.303795e-11
> K(x,c[2,])
[,1]
[1,] 8.038534e-06
> K(x,c[3,])
[,1]
[1,] 0.8132268
> which.max( c( K(x,c[1,]), K(x,c[2,]), K(x,c[3,]) ) )
[1] 3
在使用的内核函数的意义上,
最接近的质心是c[3,]=5.032692 3.401923 1.598077 0.3115385
.
the closest centroid is c[3,]=5.032692 3.401923 1.598077 0.3115385
in the sense of used kernel function.
这篇关于在内核k均值(R中的kernlab软件包)中将新数据点分配给群集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!