如何用上一个和下一个邻居的均值替换离群值? [英] How can i replace outliers with the mean of previous and next neighbour?
问题描述
我有一个非常大的数据集,它击败了两个激光频率,并用频率读出拍频.柜台.
I have a really large dataset from beating two laser frequencies and reading out the beat frequency with a freq. counter.
问题是我的数据集中有很多异常值.
The problem is that I have a lot of outliers in my dataset.
滤波不是一种选择,因为离群值的滤波/消除会杀死我用于分析拍频的Allan偏差的宝贵信息.
Filtering is not an option since the filtering/removing of outliers kills precious information for my allan deviation I use to analyze my beat frequency.
消除异常值的问题是我想比较三个不同拍频的Allan偏差.如果现在删除一些点,则我的x轴将比以前更短,而我的allan偏差x轴的缩放比例将有所不同. (adev基本上会建立一个新的x轴,从我的采样率间隔开始,直到我最长的测量时间->这是我的最高拍频x轴值.)
The problem with removing the outliers is that i want to compare allan deviations of three different beat frequencies. If i now remove some points i will have shorter x-axis than before and my allan deviation x-axis will scale differently. (The adev basically builds up a new x-axis starting with intervals of my sample rate up to my longest measurement time -> which is my highest beat frequency x-axis value.)
对不起,如果这令人困惑,我想提供尽可能多的信息.
Sorry if this is confusing, I wanted to give as many information as possible.
因此,无论如何,到目前为止,我的工作是使我所有的Allan偏差都可以工作并成功删除异常值,将我的清单切成间隔并将每个间隔的所有y值与该间隔的标准偏差进行比较.
So anyway, what i did until now is i got my whole allan deviation to work and removed outliers successfully, chopping my list into intervals and compare all y-values of each interval to the standard deviation of the interval.
我现在要更改的是,我不想删除异常值,而是希望用其上一个和下一个邻居的均值替换它们.
What i want to change now is that instead of removing the outliers i want to replace them with the mean of their previous and next neighbours.
在下面您可以找到带有异常值的列表的测试代码,在使用numpy的地方似乎有问题,而我并不真正理解为什么.
Below you can find my test code for a list with outliers, it seems have a problem using numpy where and i don't really understand why.
错误被给出为'numpy.int32'对象没有属性'where'".我必须将数据集转换为熊猫结构吗?
The error is given as "'numpy.int32' object has no attribute 'where'". Do I have to convert my dataset to a panda structure?
代码执行的操作是搜索高于/低于我的阈值的值,将其替换为NaN,然后用我的均值替换NaN.我不是真的喜欢使用NaN替代品,所以我将非常感谢您的帮助.
What the code does is searching for values above/below my threshold, replace them with NaN, and then replace NaN with my mean. I'm not really into using NaN replacement so i would be very grateful for any help.
l = np.array([[0,4],[1,3],[2,25],[3,4],[4,28],[5,4],[6,3],[7,4],[8,4]])
print(*l)
sd = np.std(l[:,1])
print(sd)
for i in l[:,1]:
if l[i,1] > sd:
print(l[i,1])
l[i,1].where(l[i,1].replace(to_replace = l[i,1], value = np.nan),
other = (l[i,1].fillna(method='ffill')+l[i,1].fillna(method='bfill'))/2)
所以我想要的是一个具有离群值的列表/数组,用先前/跟随的邻居的方式替换
so what i want is to have a list/array with the outliers replaced with the means of previous/following neighbours
错误消息:"numpy.int32"对象没有属性"where"
error message: 'numpy.int32' object has no attribute 'where'
推荐答案
一种选择的确是仅通过
import pandas as pd
dataset = pd.DataFrame({'Column1':data[:,0],'Column2':data[:,1]})
这将解决错误,因为pandas dataframe对象具有where命令. Howewer,这不是强制性的,我们仍然可以仅使用numpy进行操作
that will solve error as pandas dataframe object has where command. Howewer, that is not obligatory and we can still operate with just numpy
例如,检测异常值的最简单方法是查看异常值是否不在均值+ -3std范围内. 下面的代码示例,使用您的设置
For example, the easiest way to detect outliers is to look if they are not in range mean+-3std. Code example below, using your setting
import numpy as np
l = np.array([[0,4],[1,3],[2,25],[3,4],[4,28],[5,4],[6,3],[7,4],[8,4]])
std = np.std(l[:,1])
mean=np.mean(l[:,1])
for i in range (len(l[:,1])):
if((l[i,1]<=mean+2*std)&(l[i,1]>=mean-2*std)):
pass
else:
if (i!=len(l[:,1])-1)&(i!=0):
l[i,1]=(l[i-1,1]+l[i+1,1])/2
else:
l[i,1]=mean
我们在这里首先检查的是值在行的异常值
What we did here first check is value is outlier at line
if((l[i,1]<=mean+2*std)&(l[i,1]>=mean-2*std)):
pass
然后检查其是否不是第一个或最后一个元素
Then check if its not first or last element
if (i!=len(l[:,1])-1)&(i!=1):
如果是,则在字段中输入均值:
If it is, just put mean to the field:
else:
l[i,1]=mean
这篇关于如何用上一个和下一个邻居的均值替换离群值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!