具有PyMC3和大型数据集的贝叶斯线性回归-括号嵌套级别超过了最大值和性能下降 [英] Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

查看:223
本文介绍了具有PyMC3和大型数据集的贝叶斯线性回归-括号嵌套级别超过了最大值和性能下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用贝叶斯多元线性回归来评估团队运动(例如冰球,篮球或足球)中球员的实力.为此,我创建了一个矩阵X,其中包含玩家作为列,而比赛则作为行.对于每场比赛,玩家条目为1(在主队中的玩家),-1(在客队中的玩家)或0(玩家不参与此游戏).因变量Y定义为每场比赛中两支球队的得分差异(Score_home_team-Score_away_team).

I would like to use a Bayesian multivariate linear regression to estimate the strength of players in team sports (e.g. ice hockey, basketball or soccer). For that purpose, I create a matrix, X, containing the players as columns and the matches as rows. For each match the player entry is either 1 (player plays in the home team), -1 (player plays in the away team) or 0 (player does not take part in this game). The dependent variable Y is defined as the scoring differences for both teams in each match (Score_home_team - Score_away_team).

因此,一个赛季的参数数量将非常大(例如X由300行x 450列定义;即450个玩家系数+ y截距).运行fit时,我遇到了编译错误:

Thus, the number of parameters will be quite large for one season (e.g. X is defined by 300 rows x 450 columns; i.e. 450 player coefficients + y-intercept). When running the fit I came across a compilation error:

('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.6.5-64/tmpdxxc2379/mod.cpp:27598:32: fatal error: bracket nesting level exceeded maximum of 256.

我尝试通过以下方法处理此错误:

I tried to handle this error by setting:

theano.config.gcc.cxxflags = "-fbracket-depth=1024"

现在,采样正在运行.但是,它是如此之慢,以至于即使我只进行了300行中的35行,采样也不会在20分钟内完成.

Now, the sampling is running. However, it is so slow that even if I take only 35 of 300 rows the sampling is not completed within 20 minutes.

这是我的基本代码:

import pymc3 as pm
basic_model = pm.Model()

with basic_model:

    # Priors for beta coefficients - these are the coefficients of the players
    dict_betas = {}
    for col in X.columns:
        dict_betas[col] = pm.Normal(col, mu=0, sd=10)

    # Priors for unknown model parameters
    alpha = pm.Normal('alpha', mu=0, sd=10) # alpha is the y-intercept
    sigma = pm.HalfNormal('sigma', sd=1) # standard deviation of the observations

    # Expected value of outcome
    mu = alpha
    for col in X.columns:
        mu = mu + dict_betas[col] * X[col] # mu = alpha + beta_1 * Player_1 + beta_2 * Player_2 + ...

    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

大型数据集的模型实例化在一分钟内运行.我使用以下方法进行采样:

The instantiation of the model runs within one minute for the large dataset. I do the sampling using:

with basic_model:

    # draw 500 posterior samples
    trace = pm.sample(500)

在7分钟内完成小样本(例如9行80列)的采样.但是,时间随着样本量的增加而大大增加.

The sampling is completed for small sample sizes (e.g. 9 rows, 80 columns) within 7 minutes. However, the time is increasing substantially with increasing sample size.

有什么建议可以使贝叶斯线性回归在可行的时间内运行?使用PyMC3是否可以解决这些问题(请记住我遇到了括号嵌套错误)?我在最近的出版物中看到,这种分析在R中是可行的( https://arxiv.org/pdf/1810.08032.pdf ).因此,我想它也应该可以在Python 3中工作.

Any suggestions how I can get this Bayesian linear regression to run in a feasible amount of time? Are these kind of problems doable using PyMC3 (remember I came across a bracket nesting error)? I saw in a recent publication that this kind of analysis is doable in R (https://arxiv.org/pdf/1810.08032.pdf). Therefore, I guess it should also somehow work with Python 3.

感谢您的帮助!

推荐答案

消除for循环可以提高性能,并且还可以解决您报告的嵌套问题. Theano TensorVariables和从它们派生的PyMC3随机变量已经是多维的,并且支持线性代数运算.尝试将代码更改为类似

Eliminating the for loops should improve performance and might also take care of the nesting issue you are reporting. Theano TensorVariables and the PyMC3 random variables that derive from them are already multidimensional and support linear algebra operations. Try changing your code to something along the lines of

beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
...
mu = alpha + pm.math.dot(X, beta)
...

如果需要为mu和/或sd指定不同的先前值,则这些参数接受

If you need specify different prior values for mu and/or sd, those arguments accept anything that theano.tensor.as_tensor_variable() accepts, so you can pass a list or numpy array.

我强烈建议您熟悉 theano.tensor pymc3.math 操作,因为有时您必须使用这些操作来正确地操纵随机变量,并且在通常它应该导致更有效的代码.

I highly recommend getting familiar with the theano.tensor and pymc3.math operations since sometimes you must use these to properly manipulate random variables, and in general it should lead to more efficient code.

这篇关于具有PyMC3和大型数据集的贝叶斯线性回归-括号嵌套级别超过了最大值和性能下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆