对于简单的一维场景,推荐使用异常检测技术吗? [英] Recommended anomaly detection technique for simple, one-dimensional scenario?

查看:197
本文介绍了对于简单的一维场景,推荐使用异常检测技术吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个场景,其中有数千个数据实例.数据本身表示为单个整数值.我希望能够检测到实例何时异常极端.

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.

例如,具有以下示例数据:

For example, with the following example data:

a = 10
b = 14
c = 25
d = 467
e = 12

d显然是异常,我想基于此执行特定的操作.

d is clearly an anomaly, and I would want to perform a specific action based on this.

我很想尝试使用我对特定域的知识来检测异常.例如,找出有用的平均值的距离,然后根据试探法检查该距离.但是,我认为如果我研究更通用,更强大的异常检测技术可能会更好,这些技术背后都有一些理论依据.

I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.

由于我对数学的了解有限,所以我希望找到一种简单的技术,例如使用标准差.希望数据的一维性质将使它成为一个相当普遍的问题,但是如果需要有关该场景的更多信息,请留下评论,我将提供更多信息.

Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.

我想我会添加有关数据和我尝试过的内容的更多信息,以防一个答案比另一个答案更正确.

thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.

所有值均为正且非零.我希望这些值将形成正态分布.这种期望是基于领域的直觉而不是通过分析得出的,如果可以假设这不是一件坏事,请告诉我.在聚类方面,除非还有选择k值的标准算法,否则我很难将这个值提供给k-Means算法.

The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.

我要针对异常值/异常采取的措施是将其呈现给用户,并建议基本上从数据集中删除数据点(我不会了解他们的操作方式,但是这对我的域来说很有意义),因此它不会用作其他函数的输入.

The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.

到目前为止,我已经在有限的数据集上进行了三西格玛和IQR离群值测试. IQR标记的值不够极端,三个西格玛指出了一些实例,这些实例更符合我对域的直觉.

So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.

关于算法,技术或资源链接的信息,以了解这种特定情况是有效且值得欢迎的答案.

Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.

对于简单的一维数据,推荐的异常检测技术是什么?

What is a recommended anomaly detection technique for simple, one-dimensional data?

推荐答案

查看三西格玛规则:

mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std  THEN  x is outlier

另一种方法是 IQR离群值测试:

Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25         // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN  x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN  x is an extreme outlier

此测试通常由箱形图(由晶须指示)进行:

this test is usually employed by Box plots (indicated by the whiskers):

对于您的情况(简单的一维单变量数据),我认为我的第一个答案非常适合. 但是,这不适用于多元数据.

For your case (simple 1D univariate data), I think my first answer is well suited. That however isn't applicable to multivariate data.

@smaclell 建议使用K均值找到异常值.除了它主要是一个聚类算法(不是真正的离群值检测技术)外,k均值的问题还在于它需要事先知道一个很好的聚类数K值.

@smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.

更合适的技术是 DBSCAN :一种基于密度的聚类算法.基本上,它将具有足够高密度的区域生长为簇,这将是密度连接点的最大集合.

A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.

DBSCAN需要两个参数:epsilonminPoints.它从一个尚未被访问的任意点开始.然后找到距离起点epsilon内的所有相邻点.

DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.

如果邻居数大于或等于minPoints,则会形成一个簇.起点及其邻居被添加到该群集中,并且起点被标记为已访问.该算法然后递归地对所有邻居重复评估过程.

If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.

如果邻居数少于minPoints,则该点被标记为噪声.

If the number of neighbors is less than minPoints, the point is marked as noise.

如果集群被完全扩展(访问范围内的所有点都已访问),则该算法将迭代遍历其余未访问的点,直到其耗尽为止.

If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.

最后,所有标记为噪声的点的集合被视为离群点.

Finally the set of all points marked as noise are considered outliers.

这篇关于对于简单的一维场景,推荐使用异常检测技术吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆