为什么斜率不能很好地衡量数据趋势？ [英] Why is slope not a good measure of trends for data?

查看：248 发布时间：2020/10/15 21:36:19 python pandas numpy data-analysis

本文介绍了为什么斜率不能很好地衡量数据趋势？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

按照

我可以清楚地看到那里是数据增长的趋势。
对于结果我得到

 坡度397.78399534197837 
学位89.85596288567513 
 NRMSE：0.010041127178789659 
 Max-最小257824

并使用此数据

我得到了

 斜率349.74410929666203 
学位89.83617844631047 
 NRMSE：0.1482879344688465 
最大最小值430752

不过此数据

我得到

 坡度29.414468649823373 
学位88.05287249703134 
 NRMSE：0.3752760050624873 
最大最小值673124

如您所见，在这种情况下，没有太大的增加趋势，所以坡度e更少。

但是在这里

又有一个大斜率

 斜率228.34551214653814 
度89.74908456620851 
 NRMSE：0.3094116937517223 
最大最小值581600

我不明白为什么斜率不能清楚地表明趋势（更不用说教义了）

令我感到困惑的第二件事是，斜率取决于Y轴上数据的变化量。
例如，斜率变化很小的数据在0范围内

 坡度0.00017744046645062043 
学位0.010166589735754468 
 NRMSE：0.07312155589459704 
最大值-最小值11.349999999999998

什么是检测数据趋势的好方法，而不论其大小如何？

解决方案

这个想法是，您可以比较线性拟合是否比拟合前后数据的波动显着增加：

在底部面板中，您将看到趋势（拟合减去常数部分）超过了残差（定义为数据与拟合之间的差）。大幅增长的一个好的标准取决于数据的类型以及沿x轴的多少值。我建议您采用残差的均方根（RMS）。如果拟合趋势超过某个阈值（相对于残差），则称其为显着趋势。需要通过反复试验来确定合适的阈值。

以下是生成以上曲线的代码：

  numpy作为np 
导入matplotlib.pyplot as plt 
 
＃示例数据
x = np.arange（25，600）
y = 1.76e7 + 3e5 / 600 * x + 1e5 * np.sin（x * 0.2）
y + = np.random.normal（scale = 3e4，size = x.shape）
 
＃处理
 a1，a0 = np.polyfit（x，y，1）
 resid = y-（a1 * x + a0）＃数组
 rms = np.sqrt（（resid ** 2） .mean（））
 plt.close（'all'）
 
图，ax = plt.subplots（2，1）
 ax [0] .plot（x， y，label ='data'）
 ax [0] .plot（x，a1 * x + a0，label ='fit'）
 ax [0] .legend（）
 ax [1] .plot（x，resid，label ='residual'）
 ax [1] .plot（x，a1 *（xx [0]），label ='trend'）
 ax [ 1] .legend（）
 
 dy_trend = a1 *（x [-1]-x [0]）
 threshold = 0.3 
 
 print（f'dy_trend = {dy_trend：.3g}; rms = {rms：.3g}'）如果dy_trend>阈值*均方根值：
打印（'显着趋势'）

输出：

  dy_trend = 2.87e + 05; rms = 7.76e + 04 
显着趋势

Following the advice of this post on Analyzing trends in data with pandas, I have used numpy's polyfit on several data I have. However it does not permit me to see when there is a trend and when there isn't. I wonder what am I understanding wrong.

First the code is the following

import pandas
import matplotlib.pyplot as plt
import numpy as np


file="data.csv"


df= pandas.read_csv(file,delimiter=',',header=0)

selected=df.loc[(df.index>25)&(df.index<613)]
xx=np.arange(25,612)

y= selected[selected.columns[1]].values
    
df.plot()
plt.plot(xx,y)
plt.xlabel("seconds")


coefficients, residuals, _, _, _ = np.polyfit(range(25,25+len(y)),y,1,full=True)

plt.plot(xx,[coefficients[0]*x + coefficients[1] for x in range(25,25+len(y))])


mse = residuals[0]/(len(y))
nrmse = np.sqrt(mse)/(y.max() - y.min())
print('Slope ' + str(coefficients[0]))
print('Degree '+str(np.degrees(np.arctan(coefficients[0]))))
print('NRMSE: ' + str(nrmse))
print('Max-Min '+str((y.max()-y.min())))

I trimmed the first and last 25 points of data. As a result I got the following:

I can clearly see that there is a trend to increase in the data. For the results I got

Slope 397.78399534197837
Degree 89.85596288567513
NRMSE: 0.010041127178789659
Max-Min 257824

and with this data

I got

Slope 349.74410929666203
Degree 89.83617844631047
NRMSE: 0.1482879344688465
Max-Min 430752

However with this data

I got

Slope 29.414468649823373
Degree 88.05287249703134
NRMSE: 0.3752760050624873
Max-Min 673124

As you can see, in this there is not so much of a tendency to increase so the slope is less.

However here

again has a big slope

Slope 228.34551214653814
Degree 89.74908456620851
NRMSE: 0.3094116937517223
Max-Min 581600

I can't understand why slope is not indicating clearly the tendencies (and much less the degres)

A second thing that disconcerts me is that the slope depends on how much the data varies in the Y axis. For example with data that varies few the slope is on the range of 0

Slope 0.00017744046645062043
Degree 0.010166589735754468
NRMSE: 0.07312155589459704
Max-Min 11.349999999999998

What is a good way to detect a trend in data, independent of its magnitude?

解决方案

The idea is that you compare whether the linear fit shows a significant increase compared to the fluctuation of the data around the fit:

In the bottom panel, you see that the trend (the fit minus the constant part) exceeds residuals (defined as the difference between data and fit). What a good criterion for 'significant increase' is, depends on the type of data and also on how many values along the x axis you have. I suggest that you take the root mean square (RMS) of the residuals. If the trend in the fit exceeds some threshold (relative to the residuals), you call it a significant trend. A suitable value of the threshold needs to be established by trial and error.

Here is the code generating the plots above:

import numpy as np
import matplotlib.pyplot as plt

# example data
x = np.arange(25, 600)
y = 1.76e7 + 3e5/600*x + 1e5*np.sin(x*0.2)
y += np.random.normal(scale=3e4, size=x.shape)

# process
a1, a0 = np.polyfit(x, y, 1)
resid = y - (a1*x + a0) # array
rms = np.sqrt((resid**2).mean())
plt.close('all')

fig, ax = plt.subplots(2, 1)
ax[0].plot(x, y, label='data')
ax[0].plot(x, a1*x+a0, label='fit')
ax[0].legend()
ax[1].plot(x, resid, label='residual')
ax[1].plot(x, a1*(x-x[0]), label='trend')
ax[1].legend()

dy_trend = a1*(x[-1] - x[0])
threshold = 0.3

print(f'dy_trend={dy_trend:.3g}; rms={rms:.3g  }')

if dy_trend > threshold*rms:
    print('Significant trend')

Output:

dy_trend=2.87e+05; rms=7.76e+04
Significant trend

这篇关于为什么斜率不能很好地衡量数据趋势？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么斜率不能很好地衡量数据趋势？ [英] Why is slope not a good measure of trends for data?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么斜率不能很好地衡量数据趋势？ [英] Why is slope not a good measure of trends for data?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭