用k均值算法进行离群值检测 [英] Outlier detection with k-means algorithm

查看:471
本文介绍了用k均值算法进行离群值检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望您能帮助我解决我的问题.我正在尝试使用kmeans算法检测异常值.首先,我执行算法并选择与聚类中心有较大距离的那些离群值.我不想使用绝对距离,而是要使用相对距离,即对象到群集中心的绝对距离的比率以及群集中所有对象到群集中心的平均距离.基于绝对距离的离群值检测代码如下:

I am hoping you can help me with my problem. I am trying to detect outliers with use of the kmeans algorithm. First I perform the algorithm and choose those objects as possible outliers which have a big distance to their cluster center. Instead of using the absolute distance I want to use the relative distance, i.e. the ration of absolute distance of the object to the cluster center and the average distance of all objects of the cluster to their cluster center. The code for outlier detection based on absolute distance is the following:

# remove species from the data to cluster
iris2 <- iris[,1:4]
kmeans.result <- kmeans(iris2, centers=3)
# cluster centers
kmeans.result$centers
# calculate distances between objects and cluster centers
centers <- kmeans.result$centers[kmeans.result$cluster, ]
distances <- sqrt(rowSums((iris2 - centers)^2))
# pick top 5 largest distances
outliers <- order(distances, decreasing=T)[1:5]
# who are outliers
print(outliers)

但是如何使用相对距离而不是绝对距离来查找离群值?

But how can I use the relative instead of the absolute distance to find outliers?

推荐答案

您只需要计算每个观测值到其簇的平均距离即可.您已经有了这些距离,因此只需要对它们进行平均即可.然后剩下的就是简单的索引划分:

You just need to calculate the mean distance each observation is from its cluster. You already have those distances, so you just need to average them. Then the rest is simple indexed division:

# calculate mean distances by cluster:
m <- tapply(distances, kmeans.result$cluster,mean)

# divide each distance by the mean for its cluster:
d <- distances/(m[kmeans.result$cluster])

您的离群值:

> d[order(d, decreasing=TRUE)][1:5]
       2        3        3        1        3 
2.706694 2.485078 2.462511 2.388035 2.354807

这篇关于用k均值算法进行离群值检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆