在数据集中寻找异常值 [英] Finding outliers in a data set

查看:35
本文介绍了在数据集中寻找异常值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Python 脚本,用于创建服务器正常运行时间和性能数据列表的列表,其中每个子列表(或行")包含特定集群的统计信息.例如,很好地格式化它看起来像这样:

I have a python script that creates a list of lists of server uptime and performance data, where each sub-list (or 'row') contains a particular cluster's stats. For example, nicely formatted it looks something like this:

-------  -------------  ------------  ----------  -------------------
Cluster  %Availability  Requests/Sec  Errors/Sec  %Memory_Utilization
-------  -------------  ------------  ----------  -------------------
ams-a    98.099          1012         678          91
bos-a    98.099          1111         12           91
bos-b    55.123          1513         576          22
lax-a    99.110          988          10           89
pdx-a    98.123          1121         11           90
ord-b    75.005          1301         123          100
sjc-a    99.020          1000         10           88
...(so on)...

所以在列表形式中,它可能看起来像:

So in list form, it might look like:

[[ams-a,98.099,1012,678,91],[bos-a,98.099,1111,12,91],...]

我的问题:确定每列中异常值的最佳方法是什么?或者异常值不一定是解决发现不良"问题的最佳方法?在上面的数据中,我肯定想知道bos-b和ord-b以及ams-a,因为它的错误率很高,但其他的可以丢弃.取决于列,因为更高不一定更糟,也不一定更低,我试图找出最有效的方法来做到这一点.似乎 numpy 在这类东西中被提及很多,但不知道从哪里开始(遗憾的是,我更像是系统管理员而不是统计学家......).

My question: What's the best way to determine the outliers in each column? Or are outliers not necessarily the best way to attack the problem of finding 'badness'? In the data above, I'd definitely want to know about bos-b and ord-b, as well as ams-a since it's error rate is so high, but the others can be discarded. Depending on the column, since higher is not necessarily worse, nor is lower, I'm trying to figure out the most efficient way to do this. Seems like numpy gets mentioned a lot for this sort of stuff, but not sure where to even start with it (sadly, I'm more sysadmin than statistician...).

提前致谢!

推荐答案

您所说的发现不良"的目标意味着您要寻找的不是异常值,而是高于或低于某个阈值的观察值,而我会假设阈值会随着时间的推移保持不变.

Your stated goal of "finding badness" implies that it is not the outliers that you are looking for, but observations that fall above or below some threshold, and I would presume that the threshold would remain the same over time.

例如,如果您的所有服务器的可用性都为 98 ± 0.1%,那么可用性为 100% 的服务器将是一个异常值,可用性为 97.6% 的服务器也是如此.但这些可能在您希望的范围内.

As an example, if all of your servers were at 98 ± 0.1 % availability, a server at 100% availability would be an outlier, as would a server at 97.6% availability. But these may be within your desired limits.

另一方面,无论是否有一台或多台服务器低于此阈值,都可能有充分的理由预先通知任何服务器的可用性低于 95%.

On the other hand, there may be good reasons apriori to want to be notified of any server at less than 95% availability, whether or not there is one or many servers below this threshold.

因此,对异常值的搜索可能无法提供您感兴趣的信息.可以根据历史数据统计确定阈值,例如通过将错误率建模为泊松或将可用性百分比建模为 beta 变量.在应用设置中,这些阈值可能可以根据性能要求确定.

For this reason, a search for outliers may not provide the information that you are interested in. The thresholds could be determined statistically based on historical data, e.g. by modeling error rate as poisson or percent availability as beta variables. In an applied setting, these thresholds could probably be determined based on performance requirements.

这篇关于在数据集中寻找异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆