邻域计算,用于离群值检测 [英] Neighborhood calculations for outlier detection

查看:67
本文介绍了邻域计算,用于离群值检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R编程语言,并且试图了解用于异常值检测的以下函数的详细信息:

我的问题是:将"k"的值做大些吗?结果导致较少的离群值被识别(直方图向左偏斜),但是被识别的那些离群值更极端"(例如,直方图向左偏斜).(即更高的LOF分数)?

我观察到了这种一般模式,但是我不确定这种趋势是否会反映在LOF算法代码中.例如

不同的k值的

  #plot LOF结果par(mfrow = c(2,2))###计算LOF分数lof<-lof(x,k = 3)###离群因子的分布摘要(lof)hist(lof,main ="k = 3",breaks = 10)###计算LOF分数lof<-lof(x,k = 10)###离群因子的分布摘要(lof)hist(lof,main ="k = 10",breaks = 10)###计算LOF分数lof<-lof(x,k = 20)###离群因子的分布摘要(lof)hist(lof,main ="k = 20",breaks = 10)###计算LOF分数lof<-lof(x,k = 40)###离群因子的分布摘要(lof)hist(lof,main ="k = 10",间隔= 40) 

在以上曲线图中,您可以将"k"的值视为增加了较少的异常值被识别.这是正确的吗?

2)?选择"k"值的方法.LOF算法?看到LOF算法如何,在我看来似乎没有最佳"算法.选择值"k"的方法.看来您必须参考 1)中描述的逻辑:

  1. 更大的"k"值导致较少的异常值被识别出来,但是识别出的异常值更极端"

  2. "k"的更小值.导致发现了更多异常值,但是识别出的异常值的极端"程度较小

这正确吗?

解决方案

关于 2),我在这里找到了这个stackoverflow帖子:https://stats.stackexchange.com/questions/138675/choosing-ak-value-for-local-离群因素低位检测分析:

"该论文的作者建议选择一个最小k和一个最大k,并为每个点选择该范围内每个k的最大LOF值.他们提供了一些选择边界的准则."

这是将上述代码实现为R代码的逻辑:

 库(dbscan)#生成数据n <-100x<-cbind(x = runif(10,0,5)+ rnorm(n,sd = 0.4),y = runif(10,0,5)+ rnorm(n,sd = 0.4))x = data.frame(x)###计算不同"k"范围内的LOF分数.值:lof_10<-lof(x,k = 10)lof_15<-lof(x,k = 15)lof_20<-lof(x,k = 20)#将这些lof计算附加到原始数据集:x $ lof_10 = lof_10x $ lof_15 = lof_15x $ lof_20 = lof_20#如先前的stackoverflow文章所建议:对于每一行,请选择最高LOF值x $ max_lof = pmax(x $ lof_10,x $ lof_15,x $ lof_20)#查看结果:头(x)x y lof_10 lof_15 lof_20 max_lof1 2.443382 4.2611753 0.9803894 0.9866732 0.9841705 0.98667322 2.397454 -0.3732838 1.0527592 1.4638348 1.6008284 1.60082843 2.617348 3.0435179 0.9952212 0.9945580 0.9715819 0.99522124 3.731156 4.1668976 1.0339001 1.0802826 1.0921033 1.09210335 1.103123 1.6642337 1.1260092 1.0773444 1.0650159 1.12600926 2.735938 4.3737450 0.9939896 0.9573139 0.9700123 0.9939896 

因此,每一行的LOF分数是"max_lof"值.列.

有人可以告诉我是否正确解释了先前的stackoverflow帖子吗?我是否也正确编写了R代码?

谢谢

注意:从最初的问题来看,我仍然不确定 1)

即做更大的"k"值导致更少的离群值被识别(直方图向左偏斜),但是被识别出的离群点则更极端"?

I'm using the R programming language, and I'm trying to understand the details of the following function used for outlier detection: https://rdrr.io/cran/dbscan/src/R/LOF.R

This function (from the "dbscan" library) uses the Local Outlier Factor (LOF) algorithm for calculating outliers : https://en.wikipedia.org/wiki/Local_outlier_factor.

The LOF algorithm is an unsupervised, distance based algorithm that defines outliers in a dataset relative to the "reachability and neighborhood" of an observation. In general, observations that are not "very reachable" with respect to other observations in their neighborhood are considered to be an "outlier". Based on these properties (the user specifies these properties, e.g the neighborhood (denoted by "k") could be "3"), this algorithm assigns a LOF "score" to each point in the dataset. The bigger the LOF score for a given observation, this observation is considered to be more of an outlier.

Now, I am trying to better understand some of the calculations taking place in the dbscan::lof() function.

1) The basic LOF algorithm can be run on some artificially created data like this:

```#load library(dbscan)
par(mfrow = c(1,2))
#generate data
n <- 100
x <- cbind(
  x=runif(10, 0, 5) + rnorm(n, sd=0.4),
  y=runif(10, 0, 5) + rnorm(n, sd=0.4)
  )

### calculate LOF score
lof <- lof(x, k=3)

### distribution of outlier factors
summary(lof)
hist(lof, breaks=10)

### point size is proportional to LOF
plot(x, pch = ".", main = "LOF (k=3)")
points(x, cex = (lof-1)*3, pch = 1, col="red") ```

My question is : Do larger values of "k" result in fewer outliers being identified (histogram is left-skewed), but those outliers that are identified are more "extreme" (i.e. bigger LOF scores)?

I observed this general pattern, but I am not sure if this trend is reflected in the LOF algorithms code. E.g.

#plot LOF results for different values of k

par(mfrow = c(2,2))



### calculate LOF score
lof <- lof(x, k=3)

### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 3",breaks=10)

### calculate LOF score
lof <- lof(x, k=10)

### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 10", breaks=10)

### calculate LOF score
lof <- lof(x, k=20)

### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 20", breaks=10)


### calculate LOF score
lof <- lof(x, k=40)

### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 10", breaks=40)

In the above plots, you can see as the value of "k" increases fewer outliers are being identified. Is this correct?

2) Is there an "optimal" way to select a value of "k" for the LOF algorithm? Seeing how the LOF algorithm, it does not seem to me that there is a "optimal" way to select a value of "k". It seems that you must refer to the logic described in 1) :

  1. Bigger values of "k" result in fewer outliers being identified, but the outliers identified are more "extreme"

  2. Smaller values of "k" result in more outliers being identified, but the outliers identified are less "extreme"

Is this correct?

解决方案

Regarding 2) , I found this stackoverflow post over here: https://stats.stackexchange.com/questions/138675/choosing-a-k-value-for-local-outlier-factor-lof-detection-analysis :

" The authors of the paper recommend choosing a minimum k and a maximum k, and for each point, taking the maximum LOF value over each k in that range. They offer several guidelines for choosing the bounds."

This is my logic of implementing the above into R code:

library(dbscan)

#generate data
n <- 100
x <- cbind(
  x=runif(10, 0, 5) + rnorm(n, sd=0.4),
  y=runif(10, 0, 5) + rnorm(n, sd=0.4)
  )

x = data.frame(x)

### calculate LOF score for a range of different "k" values:

lof_10 <- lof(x, k=10)
lof_15 <- lof(x, k=15)
lof_20 <- lof(x, k=20)

#append these lof calculations the original data set:

x$lof_10 = lof_10
x$lof_15 = lof_15
x$lof_20 = lof_20

#as the previous stackoverflow post suggests: for each row, choose the highest LOF value 

x$max_lof = pmax(x$lof_10, x$lof_15, x$lof_20)

#view results:

 head(x)

         x          y    lof_10    lof_15    lof_20   max_lof
1 2.443382  4.2611753 0.9803894 0.9866732 0.9841705 0.9866732
2 2.397454 -0.3732838 1.0527592 1.4638348 1.6008284 1.6008284
3 2.617348  3.0435179 0.9952212 0.9945580 0.9715819 0.9952212
4 3.731156  4.1668976 1.0339001 1.0802826 1.0921033 1.0921033
5 1.103123  1.6642337 1.1260092 1.0773444 1.0650159 1.1260092
6 2.735938  4.3737450 0.9939896 0.9573139 0.9700123 0.9939896

Therefore, the LOF score for each row is the value of the "max_lof" column.

Can someone please tell me if interpreted the previous stackoverflow post correctly? Have I also written the R code correctly?

Thanks

Note: I am still not sure about 1) from my initial question,

i.e. Do larger values of "k" result in fewer outliers being identified (histogram is left-skewed), but those outliers that are identified are more "extreme"?

这篇关于邻域计算,用于离群值检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆