pandas :如何检测数据框中的峰值(离群值)? [英] Pandas: How to detect the peak points (outliers) in a dataframe?

查看:281
本文介绍了 pandas :如何检测数据框中的峰值(离群值)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据帧,其中有几个速度值是连续移动的值,但它是一个传感器数据,所以我们经常在中间出现错误,移动平均值似乎也无济于事,所以我可以使用哪些方法从数据中删除这些离群值或峰点?

I am having a pandas dataframe with several of speed values which is continuously moving values, but its a sensor data, so we often get the errors in the middle at some points the moving average seems to be not helping also, so what methods can I use to remove these outliers or peak points from the data?

示例:

data points = {0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9}

在此数据中的

如果我看到点4、4、5、6完全是离群值, 在我使用具有5分钟窗框的滚动平均值来平滑这些值之前,但仍然要获得很多类型的斑点点,我想删除这些斑点点,有人可以建议我采取任何技术来摆脱这些斑点吗?

in this data If I see the points 4, 4, 5, 6 are completely outlier values, before I have used the rolling mean with 5 min of window frame to smooth these values but still I am getting these type of a lot of blip points, which I want to remove, can any one suggest me any technique to get rid of these points.

我有一幅图像,它更清晰地显示了数据:

I have an image which is more clear view of data:

如果您在此处看到数据如何显示一些必须删除的离群点? 有什么想法摆脱这些问题的可能方法是什么?

if you see here how the data is showing some outlier points which I have to remove? any Idea whats the possible way to get rid of these points?

推荐答案

我真的认为使用这篇文章中查看相关问题.在那里,他们着重于在去除潜在离群值之前使用 的方法.如我所见,您的挑战要简单一些,因为从提供的数据来看,无需转换数据即可识别潜在异常值将非常简单.下面是执行此操作的代码段.不过,请记住,异常值看起来和不什么样的情况将完全取决于您的数据集.在删除了 some 异常值之后,以前看起来像不是异常值的东西现在突然会这样做.看看:

I really think z-score using scipy.stats.zscore() is the way to go here. Have a look at the related issue in this post. There they are focusing on which method to use before removing potential outliers. As I see it, your challenge is a bit simpler, since judging by the data provided, it would be pretty straight forward to identify potential outliers without having to transform the data. Below is a code snippet that does just that. Just remember though, that what does and does not look like outliers will depend entirely on your dataset. And after removing some outliers, what has not looked like an outlier before, suddenly will do so now. Have a look:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats

# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]

# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')

# Function to identify and remove outliers
def outliers(df, level):

    # 1. temporary dataframe
    df = df1.copy(deep = True)

    # 2. Select a level for a Z-score to identify and remove outliers
    df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
    ix_keep = df_Z.index

    # 3. Subset the raw dataframe with the indexes you'd like to keep
    df_keep = df.loc[ix_keep]

    return(df_keep)

原始数据:

测试运行1:Z分数= 4:

如您所见,由于级别设置过高,因此没有数据被删除.

As you can see, no data has been removed because the level was set too high.

测试运行2:Z分数= 2:

现在我们要去某个地方.两个异常值已被删除,但仍有一些可疑数据.

Now we're getting somewhere. Two outliers have been removed, but there is still some dubious data left.

测试运行3:Z分数= 1.2:

这看起来真的很好.现在,剩余数据似乎比以前更均匀地分布了.但是现在,原始数据点突出显示的数据点开始看起来有点像潜在的异常值.那么在哪里停下来呢?这完全取决于您!

This is looking really good. The remaining data now seems to be a bit more evenly distributed than before. But now the data point highlighted by the original datapoint is starting to look a bit like a potential outlier. So where to stop? That's going to be entirely up to you!

以下是简单复制和粘贴的全部内容:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats

# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]

# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')

# Function to identify and remove outliers
def outliers(df, level):

    # 1. temporary dataframe
    df = df1.copy(deep = True)

    # 2. Select a level for a Z-score to identify and remove outliers
    df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
    ix_keep = df_Z.index

    # 3. Subset the raw dataframe with the indexes you'd like to keep
    df_keep = df.loc[ix_keep]

    return(df_keep)

# remove outliers
level = 1.2
print("df_clean = outliers(df = df1, level = " + str(level)+')')
df_clean = outliers(df = df1, level = level)

# final plot
df_clean.plot(style = 'o')

这篇关于 pandas :如何检测数据框中的峰值(离群值)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆