一维观测数据中检测异常值的 Pythonic 方法 [英] Pythonic way of detecting outliers in one dimensional observation data

查看:38
本文介绍了一维观测数据中检测异常值的 Pythonic 方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于给定的数据,我想将异常值(由 95% 置信水平或 95% 分位数函数或任何所需的定义)设置为 nan 值.以下是我现在正在使用的数据和代码.如果有人能进一步解释我,我会很高兴.

For the given data, I want to set the outlier values (defined by 95% confidense level or 95% quantile function or anything that is required) as nan values. Following is the my data and code that I am using right now. I would be glad if someone could explain me further.

import numpy as np, matplotlib.pyplot as plt

data = np.random.rand(1000)+5.0

plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()

推荐答案

使用 percentile 的问题是被识别为异常值的点是样本大小的函数.

The problem with using percentile is that the points identified as outliers is a function of your sample size.

测试异常值的方法有很多种,您应该考虑如何对它们进行分类.理想情况下,您应该使用先验信息(例如高于/低于此值的任何内容都是不切实际的,因为……")

There are a huge number of ways to test for outliers, and you should give some thought to how you classify them. Ideally, you should use a-priori information (e.g. "anything above/below this value is unrealistic because...")

然而,一个常见的、不太不合理的异常值测试是根据它们的中值绝对偏差"删除点.

However, a common, not-too-unreasonable outlier test is to remove points based on their "median absolute deviation".

这是 N 维情况的实现(来自此处论文的一些代码:https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):

Here's an implementation for the N-dimensional case (from some code for a paper here: https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):

def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

这与我之前的一个答案非常相似,但我想详细说明样本量效应.

This is very similar to one of my previous answers, but I wanted to illustrate the sample size effect in detail.

让我们针对各种不同的样本量比较基于百分位数的异常值测试(类似于@CTZhu 的答案)和中值绝对偏差 (MAD) 测试:

Let's compare a percentile-based outlier test (similar to @CTZhu's answer) with a median-absolute-deviation (MAD) test for a variety of different sample sizes:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def main():
    for num in [10, 50, 100, 1000]:
        # Generate some data
        x = np.random.normal(0, 0.5, num-3)

        # Add three outliers...
        x = np.r_[x, -3, -10, 12]
        plot(x)

    plt.show()

def mad_based_outlier(points, thresh=3.5):
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

def percentile_based_outlier(data, threshold=95):
    diff = (100 - threshold) / 2.0
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

def plot(x):
    fig, axes = plt.subplots(nrows=2)
    for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
        sns.distplot(x, ax=ax, rug=True, hist=False)
        outliers = x[func(x)]
        ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)

    kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
    axes[0].set_title('Percentile-based Outliers', **kwargs)
    axes[1].set_title('MAD-based Outliers', **kwargs)
    fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)

main()

<小时>

请注意,无论样本大小如何,基于 MAD 的分类器都能正常工作,而基于百分位的分类器分类的点越多,样本量越大,无论它们是否实际上是异常值.

Notice that the MAD-based classifier works correctly regardless of sample-size, while the percentile based classifier classifies more points the larger the sample size is, regardless of whether or not they are actually outliers.

这篇关于一维观测数据中检测异常值的 Pythonic 方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆