为简单的一维场景推荐的异常检测技术? [英] Recommended anomaly detection technique for simple, one-dimensional scenario?

查看:33
本文介绍了为简单的一维场景推荐的异常检测技术?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个场景,我有几千个数据实例.数据本身表示为单个整数值.我希望能够检测到实例何时是极端异常值.

例如,使用以下示例数据:

a = 10乙 = 14c = 25d = 467e = 12

d 显然是一个异常,我想基于此执行特定操作.

我很想尝试使用我对特定领域的知识来检测异常情况.例如,找出有用的平均值的距离,并根据启发式进行检查.但是,我认为如果我研究更通用、更强大的异常检测技术,这些技术背后有一些理论,可能会更好.

由于我的数学知识有限,我希望找到一种简单的技术,例如使用标准偏差.希望数据的单维性质会使这成为一个非常普遍的问题,但如果需要有关该场景的更多信息,请发表评论,我会提供更多信息.

<小时>

我想我会添加更多关于数据和我尝试过的信息,以防它使一个答案比另一个更正确.

这些值都是正数且非零.我希望这些值会形成正态分布.这种期望是基于对领域的直觉,而不是通过分析,如果这不是一件坏事,请告诉我.在聚类方面,除非也有标准算法来选择 k 值,否则我很难将这个值提供给 k 均值算法.

我想对异常值/异常采取的操作是将其呈现给用户,并建议基本上从数据集中删除数据点(我不会深入了解他们将如何执行此操作,但是它对我的域有意义),因此它不会用作另一个函数的输入.

到目前为止,我已经在有限的数据集上尝试了三西格玛和 IQR 异常值测试.IQR 标记不够极端的值,三西格玛指出更符合我对领域直觉的实例.

<小时>

有关算法、技术或资源链接的信息以了解此特定场景是有效且受欢迎的答案.

对于简单的一维数据,推荐的异常检测技术是什么?

解决方案

查看 建议使用 K-means 找出异常值.除了它主要是一种聚类算法(并不是真正的异常值检测技术)之外,k-means 的问题在于它需要事先知道聚类数 K 的一个好的值.

更适合的技术是 DBSCAN:一种基于密度的聚类算法.基本上它将具有足够高密度的区域生长成簇,这些簇将是密度连接点的最大集合.

DBSCAN 需要两个参数:epsilonminPoints.它从一个未被访问过的任意点开始.然后找到距离起点epsilon内的所有相邻点.

如果邻居数大于或等于minPoints,则形成一个簇.起点和它的邻居被添加到这个簇中,起点被标记为已访问.然后算法递归地对所有邻居重复评估过程.

如果邻居数小于minPoints,则将该点标记为噪声.

如果集群完全扩展(访问范围内的所有点),则算法继续迭代剩余的未访问点,直到它们被耗尽.

最后,所有标记为噪声的点的集合被视为异常值.

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.

For example, with the following example data:

a = 10
b = 14
c = 25
d = 467
e = 12

d is clearly an anomaly, and I would want to perform a specific action based on this.

I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.

Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.


Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.

The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.

The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.

So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.


Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.

What is a recommended anomaly detection technique for simple, one-dimensional data?

解决方案

Check out the three-sigma rule:

mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std  THEN  x is outlier

An alternative method is the IQR outlier test:

Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25         // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN  x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN  x is an extreme outlier

this test is usually employed by Box plots (indicated by the whiskers):


EDIT:

For your case (simple 1D univariate data), I think my first answer is well suited. That however isn't applicable to multivariate data.

@smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.

A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.

DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.

If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.

If the number of neighbors is less than minPoints, the point is marked as noise.

If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.

Finally the set of all points marked as noise are considered outliers.

这篇关于为简单的一维场景推荐的异常检测技术?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆