使用平均绝对偏差的定制回归 [英] custom made regression using average absolute deviation
问题描述
在这篇文章之后,我现在严重怀疑R 平方 或 F-test 很好地表明了一些具有随机噪声的数据的良好线性拟合.因此,我想开发一个定制的回归函数,这样我既可以了解它的工作原理,也可以改进现有工具.
Following this post, I now have serious doubts if R-squared or F-test are good indications of a good linear fit into some data with random noise. Hence, I want to develop a custom made regression function so I can both learn how it works and maybe improve upon the existing tools.
考虑这些随机生成的 ndarrays x
和 y
:
Consider these randomly generated ndarrays x
and y
:
import numpy as np
np.random.seed(42)
x = np.random.rand(30) * 10
y = 1.5 * x + 0.3 + (np.random.rand(30) - 0.5) * 3.5
现在我可以定义任何一组数据点的平均/平均绝对偏差:
now I can define the average/mean absolute deviation of any set of data points with:
def aad(X, Y, a, b): # assumes X and Y are of the identical shape/size
n = X.size # highly unsafe!
U = (a * X + Y - b) / 2 / a
V = (a * X + Y + b) / 2
E = np.sqrt(np.power((X - U), 2) + np.power((Y - V), 2))
return E.sum() / n
在我看来,这是将一行 y = a * x + b
的适合度量化为数据点对的最佳方法.该函数简单地找到假设的直线与任何数据点的最近点,然后计算该点与直线之间的垂直距离.
which in my opinion is the best way to quantify the fitness of a line of y = a * x + b
into the pair of data points. The function simply finds the closest point the assumed line to any data point and then calculates the perpendicular distance between the point and the line.
现在我需要一个函数,比如:
Now I need to have a function of let's say:
linearFit(X, Y)
给定形状相同的 X
和 Y
ndarray,找到 a
和 b
aad(X, Y, a, b)
最小值.重要的是结果是绝对最小值,而不仅仅是局部的.
which given the identically shaped ndarrays of X
and Y
, finds the a
and b
which make the aad(X, Y, a, b)
minimum. It is important that the result to be an absolute minimum not just a local one.
当然,本着 SO 最佳实践的精神,我已经尝试了 scipy.optimize
函数 fmin
和 brute
,您可能会请参阅上述帖子和此处.但是,我似乎无法理解这些函数的正确语法.如果您能帮我找到假定的 linearFit
函数的规范和高性能实现,我将不胜感激.提前感谢您的支持.
Of course in the spirit of SO's best practices, I have already tried the scipy.optimize
functions fmin
and brute
, as you may see in the above-mentioned post and also here. However, it seems that I can't get my head around the right syntax for those functions. I would appreciate it if you could help me find a canonical and performant implementation for the presumed linearFit
function. Thanks for your support in advance.
P.S.此处提供的临时解决方法:
from scipy.optimize import minimize
aad_ = lambda P: aad(P[0], P[1], x1, y1)
minimize(aad_, x0=[X0, Y0])
然而,我得到的结果并不那么有希望!求解器不成功,我收到消息:
however, the results I'm getting are not that promising! The solver does not succeed and I get the message:
由于精度损失,不一定能达到预期的误差
Desired error not necessarily achieved due to precision loss
推荐答案
首先感谢这篇文章 我意识到这不是上面评论中讨论的普通最小二乘 (OLS) 回归.它实际上有很多名称,其中有戴明回归、正交距离回归 (ODR) 和总最小二乘法 (TLS).还有,当然,一个 Python 包 scipy.odr
也是如此!它的语法有点奇怪,文档也没有多大帮助,但是可以找到一个很好的教程 这里.
First of all, thanks to this post I realized that this is not an ordinary least squares (OLS) regression as was discussed in the comments above. It is actually called by many names among which Deming regression, orthogonal distance regression (ODR), and total least squares (TLS). Also there is, of course, a Python package scipy.odr
for that as well! Its syntax is a bit weird and the documentation is not much of a help, but a good tutorial can be found here.
Nex 我在 aad
定义中发现了一个小错误,并将其重命名并修复为:
Nex I found a small bug in the aad
definition and renamed and fixed it to:
def aaod(a, b, X, Y): # assumes X and Y are of the identical shape/size
n = X.size # still highly unsafe! don't use it in real production
U = (a * X + Y - b) / 2 / a
V = (a * X + Y + b) / 2
E = np.sqrt(np.power((X - U), 2) + np.power((Y - V), 2))
return E.sum() / n
代表平均绝对正交距离.现在将我们的拟合函数定义为:
standing for average absolute orthogonal distance. Now defining our fitting function as:
from scipy.optimize import minimize
from scipy.stats import linregress
def odrFit(X, Y):
X0 = linregress(X, Y) # wait this is cheating!
aaod_ = lambda P: aaod(P[0], P[1], X, Y)
res = minimize(aaod_, x0=X0[:2], method = 'Nelder-Mead')
res_list = res.x.tolist()
res_list.append(aaod_(res_list))
return res_list
这不一定是最高性能和规范的实现.我从 here 和 method = ' 中学到的临时
来自这里.lambda
函数的解决方法Nelder-Mead'scipy.odr
实现也可以这样完成:
which is not necessarily the most performant and canonical implementation. The workaround with the temporary lambda
function I learned from here and the method = 'Nelder-Mead'
from here. The scipy.odr
implementation can also be done as:
from scipy.odr import Model, ODR, RealData
def f(B, x):
return B[0]*x + B[1]
linear = Model(f)
mydata = RealData(x, y)
myodr = ODR(mydata, linear, beta0=[1., 2.])
myoutput = myodr.run()
现在比较我们定制的 odrFit()
函数和 scipy.stats.linregress()
之间的结果:
Now comparing the result between our custom-made odrFit()
function and scipy.stats.linregress()
:
slope, intercept, r_value, p_value, std_err = linregress(x,y)
print(*odrFit(x, y))
# --> 1.4804181575739097, 0.47304584702448255, 0.6008218016339527
print(slope, intercept, aaod(slope, intercept, x, y))
# --> 1.434483032725671 0.5747705643012724 0.608021569291401
print(*myoutput.beta, aaod(*myoutput.beta, x, y))
# --> 1.5118079563432785 0.23562547897245803 0.6055838996104704
这表明我们的函数实际上比 Scipy 的最小绝对偏差回归方法更准确.这实际上可能只是纯粹的运气,需要做更多的测试才能得出可靠的结论.可以在此处找到完整的代码.
which shows our function is actually more accurate than the least absolute deviation regression method of Scipy. This can actually be just pure luck and more tests need to be done to draw a reliable conclusion. The complete code can be found here.
这篇关于使用平均绝对偏差的定制回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!