pymc3中的多元线性回归 [英] Multivariate linear regression in pymc3

查看:39
本文介绍了pymc3中的多元线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在专门使用 emcee 多年后,我最近开始学习 pymc3,但我遇到了一些概念上的问题.

I've recently started learning pymc3 after exclusively using emcee for ages and I'm running into some conceptual problems.

我正在练习 Hogg 将模型拟合到数据的第 7 章.这涉及到具有任意二维不确定性的直线的 mcmc 拟合.我在 emcee 中很容易地完成了这件事,但是 pymc 给了我一些问题.

I'm practising with Chapter 7 of Hogg's Fitting a model to data. This involves a mcmc fit to a straight line with arbitrary 2d uncertainties. I've accomplished this quite easily in emcee, but pymc is giving me some problems.

它本质上归结为使用多元高斯似然.

It essentially boils down to using a multivariate gaussian likelihood.

这是我目前所拥有的.

from pymc3 import  *

import numpy as np
import matplotlib.pyplot as plt

size = 200
true_intercept = 1
true_slope = 2

true_x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * true_x
# add noise

# here the errors are all the same but the real world they are usually not!
std_y, std_x = 0.1, 0.1 
y = true_regression_line + np.random.normal(scale=std_y, size=size)
x = true_x + np.random.normal(scale=std_x, size=size)

y_err = np.ones_like(y) * std_y
x_err = np.ones_like(x) * std_x

data = dict(x=x, y=y)

with Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
    # Define priors
    intercept = Normal('Intercept', 0, sd=20)
    gradient = Normal('gradient', 0, sd=20)


    # Define likelihood
    likelihood = MvNormal('y', mu=intercept + gradient * x,
                        tau=1./(np.stack((y_err, x_err))**2.), observed=y)

    # start the mcmc!
    start = find_MAP() # Find starting value by optimization
    step = NUTS(scaling=start) # Instantiate MCMC sampling algorithm
    trace = sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling

这会引发错误:LinAlgError:数组的最后 2 个维度必须是正方形

所以我试图通过 MvNormal x 和 y 的测量值(mus)及其相关的测量不确定度(y_errx_err).但它似乎不喜欢 2d tau 参数.

So I'm trying to pass MvNormal the measured values for x and y (mus) and their associated measurement uncertainties (y_err and x_err). But it appears that it is not liking the 2d tau argument.

有什么想法吗?这一定是可能的

Any ideas? This must be possible

谢谢

推荐答案

您可以尝试调整以下模型.是常规"线性回归.但是 xy 已经被高斯分布取代了.在这里,我不仅假设输入和输出变量的测量值,而且还假设其误差的可靠估计(例如由测量设备提供).如果您不相信这些误差值,您可以尝试从数据中估计它们.

You may try by adapting the following model. Is a "regular" linear regression. But x and y have been replaced by Gaussian distributions. Here I am assuming not only the measured values of the input and output variables but also a reliable estimation of the their error (for example as provided by a measurement device). If you do not trust those error values you may instead try to estimate them from the data.

with pm.Model() as model:
    intercept = pm.Normal('intercept', 0, sd=20)
    gradient = pm.Normal('gradient', 0, sd=20)
    epsilon = pm.HalfCauchy('epsilon', 5)
    obs_x = pm.Normal('obs_x', mu=x, sd=x_err, shape=len(x))
    obs_y = pm.Normal('obs_y', mu=y, sd=y_err, shape=len(y))

    likelihood = pm.Normal('y', mu=intercept + gradient * obs_x,
                    sd=epsilon, observed=obs_y)

    trace = pm.sample(2000)

如果您从数据中估计误差,可以合理地假设它们是相关的,因此,您可以使用多元高斯,而不是使用两个单独的高斯.在这种情况下,您将得到如下所示的模型:

If you are estimating the error from the data it could be reasonable to assume they could be correlated and hence, instead of using two separate Gaussian you can use a Multivariate Gaussian. In such a case you will end up with a model like the following:

df_data = pd.DataFrame(data)
cov = df_data.cov()

with pm.Model() as model:
    intercept = pm.Normal('intercept', 0, sd=20)
    gradient = pm.Normal('gradient', 0, sd=20)
    epsilon = pm.HalfCauchy('epsilon', 5)

    obs_xy = pm.MvNormal('obs_xy', mu=df_data, tau=pm.matrix_inverse(cov), shape=df_data.shape)

    yl = pm.Normal('yl', mu=intercept + gradient * obs_xy[:,0],
                    sd=epsilon, observed=obs_xy[:,1])

mu, sds, elbo = pm.variational.advi(n=20000)
step =  pm.NUTS(scaling=model.dict_to_array(sds), is_cov=True)
trace = pm.sample(1000, step=step, start=mu)

请注意,在之前的模型中,协方差矩阵是根据数据计算出来的.如果您打算这样做,那么我认为使用第一个模型会更好,但是如果您要估计协方差矩阵,那么第二个模型可能是一种明智的方法.

Notice that in the previous model the covariance matrix was computed from the data. If you are going to do that then I think is better to go with the first model, but if instead you are going to estimate the covariance matrix then the second model could be a sensible approach.

对于第二个模型,我使用 ADVI 对其进行初始化.ADVI 是初始化模型的好方法,通常比 find_MAP() 效果更好.

For the second model I use ADVI to initialize it. ADVI can be a good way to initialize models, often it works much better than find_MAP().

您可能还想查看 David Hogg 的这个存储库.还有Statistical Rethinking 这本书,其中 McElreath 讨论了进行线性回归的问题,包括输入中的错误和输出变量.

You may also want to check this repository by David Hogg. And the book Statistical Rethinking where McElreath discuss the problem of doing linear regression including the errors in the input and output variables.

这篇关于pymc3中的多元线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆