使用平均绝对偏差的定制回归 [英] custom made regression using average absolute deviation

查看:60
本文介绍了使用平均绝对偏差的定制回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这篇文章之后,我现在严重怀疑R 平方F-test 很好地表明了一些具有随机噪声的数据的良好线性拟合.因此,我想开发一个定制的回归函数,这样我既可以了解它的工作原理,也可以改进现有工具.

Following this post, I now have serious doubts if R-squared or F-test are good indications of a good linear fit into some data with random noise. Hence, I want to develop a custom made regression function so I can both learn how it works and maybe improve upon the existing tools.

考虑这些随机生成的 ndarrays xy:

Consider these randomly generated ndarrays x and y:

import numpy as np

np.random.seed(42)

x = np.random.rand(30) * 10
y = 1.5 * x + 0.3 + (np.random.rand(30) - 0.5) * 3.5

现在我可以定义任何一组数据点的平均/平均绝对偏差:

now I can define the average/mean absolute deviation of any set of data points with:

def aad(X, Y, a, b): # assumes X and Y are of the identical shape/size
    n = X.size # highly unsafe!
    U = (a * X + Y - b) / 2 / a
    V = (a * X + Y + b) / 2
    E = np.sqrt(np.power((X - U), 2) + np.power((Y - V), 2))
    return E.sum() / n

在我看来,这是将一行 y = a * x + b 的适合度量化为数据点对的最佳方法.该函数简单地找到假设的直线与任何数据点的最近点,然后计算该点与直线之间的垂直距离.

which in my opinion is the best way to quantify the fitness of a line of y = a * x + b into the pair of data points. The function simply finds the closest point the assumed line to any data point and then calculates the perpendicular distance between the point and the line.

现在我需要一个函数,比如:

Now I need to have a function of let's say:

linearFit(X, Y)

给定形状相同的 XY ndarray,找到 abaad(X, Y, a, b) 最小值.重要的是结果是绝对最小值,而不仅仅是局部的.

which given the identically shaped ndarrays of X and Y, finds the a and b which make the aad(X, Y, a, b) minimum. It is important that the result to be an absolute minimum not just a local one.

当然,本着 SO 最佳实践的精神,我已经尝试了 scipy.optimize 函数 fminbrute,您可能会请参阅上述帖子此处.但是,我似乎无法理解这些函数的正确语法.如果您能帮我找到假定的 linearFit 函数的规范和高性能实现,我将不胜感激.提前感谢您的支持.

Of course in the spirit of SO's best practices, I have already tried the scipy.optimize functions fmin and brute, as you may see in the above-mentioned post and also here. However, it seems that I can't get my head around the right syntax for those functions. I would appreciate it if you could help me find a canonical and performant implementation for the presumed linearFit function. Thanks for your support in advance.

P.S.此处提供的临时解决方法:

from scipy.optimize import minimize

aad_ = lambda P: aad(P[0], P[1], x1, y1)
minimize(aad_, x0=[X0, Y0])

然而,我得到的结果并不那么有希望!求解器不成功,我收到消息:

however, the results I'm getting are not that promising! The solver does not succeed and I get the message:

由于精度损失,不一定能达到预期的误差

Desired error not necessarily achieved due to precision loss

推荐答案

首先感谢这篇文章 我意识到这不是上面评论中讨论的普通最小二乘 (OLS) 回归.它实际上有很多名称,其中有戴明回归、正交距离回归 (ODR) 和总最小二乘法 (TLS).还有,当然一个 Python 包 scipy.odr 也是如此!它的语法有点奇怪,文档也没有多大帮助,但是可以找到一个很好的教程 这里.

First of all, thanks to this post I realized that this is not an ordinary least squares (OLS) regression as was discussed in the comments above. It is actually called by many names among which Deming regression, orthogonal distance regression (ODR), and total least squares (TLS). Also there is, of course, a Python package scipy.odr for that as well! Its syntax is a bit weird and the documentation is not much of a help, but a good tutorial can be found here.

Nex 我在 aad 定义中发现了一个小错误,并将其重命名并修复为:

Nex I found a small bug in the aad definition and renamed and fixed it to:

def aaod(a, b, X, Y): # assumes X and Y are of the identical shape/size
    n = X.size # still highly unsafe! don't use it in real production
    U = (a * X + Y - b) / 2 / a
    V = (a * X + Y + b) / 2
    E = np.sqrt(np.power((X - U), 2) + np.power((Y - V), 2))
    return E.sum() / n

代表平均绝对正交距离.现在将我们的拟合函数定义为:

standing for average absolute orthogonal distance. Now defining our fitting function as:

from scipy.optimize import minimize
from scipy.stats import linregress

def odrFit(X, Y):
    X0 = linregress(X, Y) # wait this is cheating!
    aaod_ = lambda P: aaod(P[0], P[1], X, Y)
    res = minimize(aaod_, x0=X0[:2], method = 'Nelder-Mead')
    res_list = res.x.tolist()
    res_list.append(aaod_(res_list))
    return res_list

这不一定是最高性能和规范的实现.我从 heremethod = ' 中学到的临时 lambda 函数的解决方法Nelder-Mead' 来自这里.scipy.odr 实现也可以这样完成:

which is not necessarily the most performant and canonical implementation. The workaround with the temporary lambda function I learned from here and the method = 'Nelder-Mead' from here. The scipy.odr implementation can also be done as:

from scipy.odr import Model, ODR, RealData

def f(B, x):
    return B[0]*x + B[1]

linear = Model(f)
mydata = RealData(x, y)
myodr = ODR(mydata, linear, beta0=[1., 2.])
myoutput = myodr.run()

现在比较我们定制的 odrFit() 函数和 scipy.stats.linregress() 之间的结果:

Now comparing the result between our custom-made odrFit() function and scipy.stats.linregress():

slope, intercept, r_value, p_value, std_err = linregress(x,y)

print(*odrFit(x, y)) 
# --> 1.4804181575739097, 0.47304584702448255, 0.6008218016339527

print(slope, intercept, aaod(slope, intercept, x, y))
# --> 1.434483032725671 0.5747705643012724 0.608021569291401

print(*myoutput.beta, aaod(*myoutput.beta, x, y))
# --> 1.5118079563432785 0.23562547897245803 0.6055838996104704

这表明我们的函数实际上比 Scipy 的最小绝对偏差回归方法更准确.这实际上可能只是纯粹的运气,需要做更多的测试才能得出可靠的结论.可以在此处找到完整的代码.

which shows our function is actually more accurate than the least absolute deviation regression method of Scipy. This can actually be just pure luck and more tests need to be done to draw a reliable conclusion. The complete code can be found here.

这篇关于使用平均绝对偏差的定制回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆