pymc3中的多元线性回归 [英] Multivariate linear regression in pymc3
问题描述
在专门使用 emcee
多年后,我最近开始学习 pymc3
,但我遇到了一些概念上的问题.
I've recently started learning pymc3
after exclusively using emcee
for ages and I'm running into some conceptual problems.
我正在练习 Hogg 将模型拟合到数据的第 7 章.这涉及到具有任意二维不确定性的直线的 mcmc 拟合.我在 emcee
中很容易地完成了这件事,但是 pymc
给了我一些问题.
I'm practising with Chapter 7 of Hogg's Fitting a model to data. This involves a mcmc fit to a straight line with arbitrary 2d uncertainties. I've accomplished this quite easily in emcee
, but pymc
is giving me some problems.
它本质上归结为使用多元高斯似然.
It essentially boils down to using a multivariate gaussian likelihood.
这是我目前所拥有的.
from pymc3 import *
import numpy as np
import matplotlib.pyplot as plt
size = 200
true_intercept = 1
true_slope = 2
true_x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * true_x
# add noise
# here the errors are all the same but the real world they are usually not!
std_y, std_x = 0.1, 0.1
y = true_regression_line + np.random.normal(scale=std_y, size=size)
x = true_x + np.random.normal(scale=std_x, size=size)
y_err = np.ones_like(y) * std_y
x_err = np.ones_like(x) * std_x
data = dict(x=x, y=y)
with Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
# Define priors
intercept = Normal('Intercept', 0, sd=20)
gradient = Normal('gradient', 0, sd=20)
# Define likelihood
likelihood = MvNormal('y', mu=intercept + gradient * x,
tau=1./(np.stack((y_err, x_err))**2.), observed=y)
# start the mcmc!
start = find_MAP() # Find starting value by optimization
step = NUTS(scaling=start) # Instantiate MCMC sampling algorithm
trace = sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling
这会引发错误:LinAlgError:数组的最后 2 个维度必须是正方形
所以我试图通过 MvNormal
x 和 y 的测量值(mu
s)及其相关的测量不确定度(y_err
和 x_err
).但它似乎不喜欢 2d tau
参数.
So I'm trying to pass MvNormal
the measured values for x and y (mu
s) and their associated measurement uncertainties (y_err
and x_err
). But it appears that it is not liking the 2d tau
argument.
有什么想法吗?这一定是可能的
Any ideas? This must be possible
谢谢
推荐答案
您可以尝试调整以下模型.是常规"线性回归.但是 x
和 y
已经被高斯分布取代了.在这里,我不仅假设输入和输出变量的测量值,而且还假设其误差的可靠估计(例如由测量设备提供).如果您不相信这些误差值,您可以尝试从数据中估计它们.
You may try by adapting the following model. Is a "regular" linear regression. But x
and y
have been replaced by Gaussian distributions. Here I am assuming not only the measured values of the input and output variables but also a reliable estimation of the their error (for example as provided by a measurement device). If you do not trust those error values you may instead try to estimate them from the data.
with pm.Model() as model:
intercept = pm.Normal('intercept', 0, sd=20)
gradient = pm.Normal('gradient', 0, sd=20)
epsilon = pm.HalfCauchy('epsilon', 5)
obs_x = pm.Normal('obs_x', mu=x, sd=x_err, shape=len(x))
obs_y = pm.Normal('obs_y', mu=y, sd=y_err, shape=len(y))
likelihood = pm.Normal('y', mu=intercept + gradient * obs_x,
sd=epsilon, observed=obs_y)
trace = pm.sample(2000)
如果您从数据中估计误差,可以合理地假设它们是相关的,因此,您可以使用多元高斯,而不是使用两个单独的高斯.在这种情况下,您将得到如下所示的模型:
If you are estimating the error from the data it could be reasonable to assume they could be correlated and hence, instead of using two separate Gaussian you can use a Multivariate Gaussian. In such a case you will end up with a model like the following:
df_data = pd.DataFrame(data)
cov = df_data.cov()
with pm.Model() as model:
intercept = pm.Normal('intercept', 0, sd=20)
gradient = pm.Normal('gradient', 0, sd=20)
epsilon = pm.HalfCauchy('epsilon', 5)
obs_xy = pm.MvNormal('obs_xy', mu=df_data, tau=pm.matrix_inverse(cov), shape=df_data.shape)
yl = pm.Normal('yl', mu=intercept + gradient * obs_xy[:,0],
sd=epsilon, observed=obs_xy[:,1])
mu, sds, elbo = pm.variational.advi(n=20000)
step = pm.NUTS(scaling=model.dict_to_array(sds), is_cov=True)
trace = pm.sample(1000, step=step, start=mu)
请注意,在之前的模型中,协方差矩阵是根据数据计算出来的.如果您打算这样做,那么我认为使用第一个模型会更好,但是如果您要估计协方差矩阵,那么第二个模型可能是一种明智的方法.
Notice that in the previous model the covariance matrix was computed from the data. If you are going to do that then I think is better to go with the first model, but if instead you are going to estimate the covariance matrix then the second model could be a sensible approach.
对于第二个模型,我使用 ADVI 对其进行初始化.ADVI 是初始化模型的好方法,通常比 find_MAP() 效果更好.
For the second model I use ADVI to initialize it. ADVI can be a good way to initialize models, often it works much better than find_MAP().
您可能还想查看 David Hogg 的这个存储库.还有Statistical Rethinking 这本书,其中 McElreath 讨论了进行线性回归的问题,包括输入中的错误和输出变量.
You may also want to check this repository by David Hogg. And the book Statistical Rethinking where McElreath discuss the problem of doing linear regression including the errors in the input and output variables.
这篇关于pymc3中的多元线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!