如何消除数据的急剧变化? [英] How can I remove sharp jumps in data?

查看:75
本文介绍了如何消除数据的急剧变化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些要分析的皮肤温度数据(以1Hz采集)。



但是,传感器并不总是与皮肤接触。因此,在保留实际皮肤温度数据的同时,要删除此非皮肤温度数据是一个挑战。我大约需要分析100个文件,因此我需要使其自动化。



我知道已经有



要消除这些跳跃,我正在考虑采用一种方法,即使用温度的一阶微分,然后使用另一组阈值来摆脱我不是的数据



例如

  df_diff = df.diff(60 )#约60个使跳的时间段突出

filter_index = np.nonzero((df.Temp< -1)|(df.Temp> 0.5))#当diff小于-1时大于0.5,则最有可能发生数据跳跃。



但是,我发现自己陷在这里了。主要问题是:



1)我现在不知道如何使用此索引列表删除df中的非皮肤数据。



更小的问题是
2)我想我仍然会因数据跳跃而留下一些残留的假象边缘(例如,更严格的阈值将开始丢弃好的数据)。有没有更好的过滤策略或一种方法来消除这些假象?



*根据建议编辑,我也已经计算出了二阶差异,但是老实说,我认为一阶比较将允许更严格的阈值(见下文):





*编辑2:



我相信您的第一个问题已通过上面的.loc选项得到了回答。



第二个问题将对数据集进行一些实验。上面的代码仅选择高导数数据。您还需要选择阈值才能删除零等。您可以尝试在何处进行导数选择。您还可以绘制导数的直方图,以提示您选择什么。



此外,高阶差分方程也可能有助于平滑。这将有助于消除伪影,而不必在切口周围进行修整。



编辑:



可以使用以下方法应用四阶有限差分:

  df [2] =(df [1] .diff(periods = 1)-df [1] .diff(periods = -1))* 8/12-\ 
(df [1] .diff(periods = 2)-df [1] .diff(periods = -2))* 1/12
df [2] = df [2] .abs()

有理由认为这可能会有所帮助。可以计算出上述系数,也可以从下面的链接获得较高的系数。
有限差分系数计算器



注:上面的二阶和四阶中心差分方程不是正确的一阶导数。必须除以间隔长度(在本例中为0.005)才能得出实际的导数。


I have some skin temperature data (collected at 1Hz) which I intend to analyse.

However, the sensors were not always in contact with the skin. So I have a challenge of removing this non-skin temperature data, whilst preserving the actual skin temperature data. I have about 100 files to analyse, so I need to make this automated.

I'm aware that there is already this similar post, however I've not been able to use that to solve my problem.

My data roughly looks like this:

df =

timeStamp                 Temp
2018-05-04 10:08:00       28.63
         .                  . 
         .                  .
2018-05-04 21:00:00       31.63

The first step I've taken is to simply apply a minimum threshold- this has got rid of the majority of the non-skin data. However, I'm left with the sharp jumps where the sensor was either removed or attached:

To remove these jumps, I was thinking about taking an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.

e.g.

df_diff = df.diff(60) # period of about 60 makes jumps stick out

filter_index = np.nonzero((df.Temp <-1) | (df.Temp>0.5)) # when diff is less than -1 and greater than 0.5, most likely data jumps.

However, I find myself stuck here. The main problem is that:

1) I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?

The more minor problem is that 2) I think I will still be left with some residual artefacts from the data jumps near the edges (e.g. where a tighter threshold would start to chuck away good data). Is there either a better filtering strategy or a way to then get rid of these artefacts?

*Edit as suggested I've also calculated the second order diff, but to be honest, I think the first order diff would allow for tighter thresholds (see below):

*Edit 2: Link to sample data

解决方案

Try the code below (I used a tangent function to generate data). I used the second order difference idea from Mad Physicist in the comments.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame()
df[0] = np.arange(0,10,0.005)
df[1] = np.tan(df[0])

#the following line calculates the absolute value of a second order finite 
#difference (derivative)
df[2] = 0.5*(df[1].diff()+df[1].diff(periods=-1)).abs()

df.loc[df[2] < .05][1].plot() #select out regions of a high rate-of-change 
df[1].plot()                  #plot original data

plt.show()

Following is a zoom of the output showing what got filtered. Matplotlib plots a line from beginning to end of the removed data.

Your first question I believe is answered with the .loc selection above.

You second question will take some experimentation with your dataset. The code above only selects out high-derivative data. You'll also need your threshold selection to remove zeroes or the like. You can experiment with where to make the derivative selection. You can also plot a histogram of the derivative to give you a hint as to what to select out.

Also, higher order difference equations are possible to help with smoothing. This should help remove artifacts without having to trim around the cuts.

Edit:

A fourth-order finite difference can be applied using this:

df[2] = (df[1].diff(periods=1)-df[1].diff(periods=-1))*8/12 - \
    (df[1].diff(periods=2)-df[1].diff(periods=-2))*1/12
df[2] = df[2].abs()

It's reasonable to think that it may help. The coefficients above can be worked out or derived from the following link for higher orders. Finite Difference Coefficients Calculator

Note: The above second and fourth order central difference equations are not proper first derivatives. One must divide by the interval length (in this case 0.005) to get the actual derivative.

这篇关于如何消除数据的急剧变化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆