一维观测数据中检测异常值的 Pythonic 方法 [英] Pythonic way of detecting outliers in one dimensional observation data

查看：38 发布时间：2021/12/11 14:16:16 python numpy matplotlib statistics statsmodels

本文介绍了一维观测数据中检测异常值的 Pythonic 方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于给定的数据，我想将异常值(由 95% 置信水平或 95% 分位数函数或任何所需的定义)设置为 nan 值.以下是我现在正在使用的数据和代码.如果有人能进一步解释我，我会很高兴.

For the given data, I want to set the outlier values (defined by 95% confidense level or 95% quantile function or anything that is required) as nan values. Following is the my data and code that I am using right now. I would be glad if someone could explain me further.

import numpy as np, matplotlib.pyplot as plt

data = np.random.rand(1000)+5.0

plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()

推荐答案

使用 percentile 的问题是被识别为异常值的点是样本大小的函数.

The problem with using percentile is that the points identified as outliers is a function of your sample size.

测试异常值的方法有很多种，您应该考虑如何对它们进行分类.理想情况下，您应该使用先验信息(例如高于/低于此值的任何内容都是不切实际的，因为……")

There are a huge number of ways to test for outliers, and you should give some thought to how you classify them. Ideally, you should use a-priori information (e.g. "anything above/below this value is unrealistic because...")

然而，一个常见的、不太不合理的异常值测试是根据它们的中值绝对偏差"删除点.

However, a common, not-too-unreasonable outlier test is to remove points based on their "median absolute deviation".

这是 N 维情况的实现(来自此处论文的一些代码:https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):

Here's an implementation for the N-dimensional case (from some code for a paper here: https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):

def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

这与我之前的一个答案非常相似，但我想详细说明样本量效应.

This is very similar to one of my previous answers, but I wanted to illustrate the sample size effect in detail.

让我们针对各种不同的样本量比较基于百分位数的异常值测试(类似于@CTZhu 的答案)和中值绝对偏差 (MAD) 测试:

Let's compare a percentile-based outlier test (similar to @CTZhu's answer) with a median-absolute-deviation (MAD) test for a variety of different sample sizes:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def main():
    for num in [10, 50, 100, 1000]:
        # Generate some data
        x = np.random.normal(0, 0.5, num-3)

        # Add three outliers...
        x = np.r_[x, -3, -10, 12]
        plot(x)

    plt.show()

def mad_based_outlier(points, thresh=3.5):
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

def percentile_based_outlier(data, threshold=95):
    diff = (100 - threshold) / 2.0
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

def plot(x):
    fig, axes = plt.subplots(nrows=2)
    for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
        sns.distplot(x, ax=ax, rug=True, hist=False)
        outliers = x[func(x)]
        ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)

    kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
    axes[0].set_title('Percentile-based Outliers', **kwargs)
    axes[1].set_title('MAD-based Outliers', **kwargs)
    fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)

main()

<小时>

请注意，无论样本大小如何，基于 MAD 的分类器都能正常工作，而基于百分位的分类器分类的点越多，样本量越大，无论它们是否实际上是异常值.

Notice that the MAD-based classifier works correctly regardless of sample-size, while the percentile based classifier classifies more points the larger the sample size is, regardless of whether or not they are actually outliers.

这篇关于一维观测数据中检测异常值的 Pythonic 方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

一维观测数据中检测异常值的 Pythonic 方法 [英] Pythonic way of detecting outliers in one dimensional observation data

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

一维观测数据中检测异常值的 Pythonic 方法 [英] Pythonic way of detecting outliers in one dimensional observation data

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭