为什么斜率不能很好地衡量数据趋势? [英] Why is slope not a good measure of trends for data?

查看:248
本文介绍了为什么斜率不能很好地衡量数据趋势?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

按照


我可以清楚地看到那里是数据增长的趋势。
对于结果我得到

 坡度397.78399534197837 
学位89.85596288567513
NRMSE:0.010041127178789659
Max-最小257824

并使用此数据



我得到了

 斜率349.74410929666203 
学位89.83617844631047
NRMSE:0.1482879344688465
最大最小值430752

不过此数据



我得到

 坡度29.414468649823373 
学位88.05287249703134
NRMSE:0.3752760050624873
最大最小值673124

如您所见,在这种情况下,没有太大的增加趋势,所以坡度e更少。


但是在这里



又有一个大斜率

 斜率228.34551214653814 
度89.74908456620851
NRMSE:0.3094116937517223
最大最小值581600

我不明白为什么斜率不能清楚地表明趋势(更不用说教义了)


令我感到困惑的第二件事是,斜率取决于Y轴上数据的变化量。
例如,斜率变化很小的数据在0范围内


 坡度0.00017744046645062043 
学位0.010166589735754468
NRMSE:0.07312155589459704
最大值-最小值11.349999999999998

什么是检测数据趋势的好方法,而不论其大小如何?

解决方案

这个想法是,您可以比较线性拟合是否比拟合前后数据的波动显着增加:



在底部面板中,您将看到趋势(拟合减去常数部分)超过了残差(定义为数据与拟合之间的差)。 大幅增长的一个好的标准取决于数据的类型以及沿x轴的多少值。我建议您采用残差的均方根(RMS)。如果拟合趋势超过某个阈值(相对于残差),则称其为显着趋势。需要通过反复试验来确定合适的阈值。


以下是生成以上曲线的代码:

  numpy作为np 
导入matplotlib.pyplot as plt

#示例数据
x = np.arange(25,600)
y = 1.76e7 + 3e5 / 600 * x + 1e5 * np.sin(x * 0.2)
y + = np.random.normal(scale = 3e4,size = x.shape)

#处理
a1,a0 = np.polyfit(x,y,1)
resid = y-(a1 * x + a0)#数组
rms = np.sqrt((resid ** 2) .mean())
plt.close('all')

图,ax = plt.subplots(2,1)
ax [0] .plot(x, y,label ='data')
ax [0] .plot(x,a1 * x + a0,label ='fit')
ax [0] .legend()
ax [1] .plot(x,resid,label ='residual')
ax [1] .plot(x,a1 *(xx [0]),label ='trend')
ax [ 1] .legend()

dy_trend = a1 *(x [-1]-x [0])
threshold = 0.3

print(f'dy_trend = {dy_trend:.3g}; rms = {rms:.3g}')如果dy_trend>阈值*均方根值:
打印('显着趋势')

输出:

  dy_trend = 2.87e + 05; rms = 7.76e + 04 
显着趋势


Following the advice of this post on Analyzing trends in data with pandas, I have used numpy's polyfit on several data I have. However it does not permit me to see when there is a trend and when there isn't. I wonder what am I understanding wrong.

First the code is the following

import pandas
import matplotlib.pyplot as plt
import numpy as np


file="data.csv"


df= pandas.read_csv(file,delimiter=',',header=0)

selected=df.loc[(df.index>25)&(df.index<613)]
xx=np.arange(25,612)

y= selected[selected.columns[1]].values
    
df.plot()
plt.plot(xx,y)
plt.xlabel("seconds")


coefficients, residuals, _, _, _ = np.polyfit(range(25,25+len(y)),y,1,full=True)

plt.plot(xx,[coefficients[0]*x + coefficients[1] for x in range(25,25+len(y))])


mse = residuals[0]/(len(y))
nrmse = np.sqrt(mse)/(y.max() - y.min())
print('Slope ' + str(coefficients[0]))
print('Degree '+str(np.degrees(np.arctan(coefficients[0]))))
print('NRMSE: ' + str(nrmse))
print('Max-Min '+str((y.max()-y.min())))

I trimmed the first and last 25 points of data. As a result I got the following:

I can clearly see that there is a trend to increase in the data. For the results I got

Slope 397.78399534197837
Degree 89.85596288567513
NRMSE: 0.010041127178789659
Max-Min 257824

and with this data

I got

Slope 349.74410929666203
Degree 89.83617844631047
NRMSE: 0.1482879344688465
Max-Min 430752

However with this data

I got

Slope 29.414468649823373
Degree 88.05287249703134
NRMSE: 0.3752760050624873
Max-Min 673124

As you can see, in this there is not so much of a tendency to increase so the slope is less.

However here

again has a big slope

Slope 228.34551214653814
Degree 89.74908456620851
NRMSE: 0.3094116937517223
Max-Min 581600

I can't understand why slope is not indicating clearly the tendencies (and much less the degres)

A second thing that disconcerts me is that the slope depends on how much the data varies in the Y axis. For example with data that varies few the slope is on the range of 0

Slope 0.00017744046645062043
Degree 0.010166589735754468
NRMSE: 0.07312155589459704
Max-Min 11.349999999999998

What is a good way to detect a trend in data, independent of its magnitude?

解决方案

The idea is that you compare whether the linear fit shows a significant increase compared to the fluctuation of the data around the fit:

In the bottom panel, you see that the trend (the fit minus the constant part) exceeds residuals (defined as the difference between data and fit). What a good criterion for 'significant increase' is, depends on the type of data and also on how many values along the x axis you have. I suggest that you take the root mean square (RMS) of the residuals. If the trend in the fit exceeds some threshold (relative to the residuals), you call it a significant trend. A suitable value of the threshold needs to be established by trial and error.

Here is the code generating the plots above:

import numpy as np
import matplotlib.pyplot as plt

# example data
x = np.arange(25, 600)
y = 1.76e7 + 3e5/600*x + 1e5*np.sin(x*0.2)
y += np.random.normal(scale=3e4, size=x.shape)

# process
a1, a0 = np.polyfit(x, y, 1)
resid = y - (a1*x + a0) # array
rms = np.sqrt((resid**2).mean())
plt.close('all')

fig, ax = plt.subplots(2, 1)
ax[0].plot(x, y, label='data')
ax[0].plot(x, a1*x+a0, label='fit')
ax[0].legend()
ax[1].plot(x, resid, label='residual')
ax[1].plot(x, a1*(x-x[0]), label='trend')
ax[1].legend()

dy_trend = a1*(x[-1] - x[0])
threshold = 0.3

print(f'dy_trend={dy_trend:.3g}; rms={rms:.3g  }')

if dy_trend > threshold*rms:
    print('Significant trend')

Output:

dy_trend=2.87e+05; rms=7.76e+04
Significant trend

这篇关于为什么斜率不能很好地衡量数据趋势?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆