在PyMC3中运行多元有序logit [英] Running a multivariate ordered logit in PyMC3

查看:92
本文介绍了在PyMC3中运行多元有序logit的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用PyMC3建立贝叶斯多元有序logit模型.我已经基于这本书我还基于此内容底部的示例运行了有序逻辑回归模型页.

I'm trying to build a Bayesian multivariate ordered logit model using PyMC3. I have gotten a toy multivariate logit model working based on the examples in this book. I've also gotten an ordered logistic regression model running based on the example at the bottom of this page.

但是,我无法运行有序的多元逻辑回归.我认为问题可能是指定切点的方式,特别是shape参数,但是我不确定为什么有多个自变量而不是只有一个自变量,因为响应类别的数量还没有改变了.

However, I cannot get an ordered, multivariate logistic regression to run. I think the issue could be the way the cutpoints are specified, specifically the shape parameter, but I'm not sure why it would be different if there are multiple independent variables than if there were just one, since the number of response categories has not changed.

这是我的代码:

MWE的数据准备:

import pymc3 as pm
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris(return_X_y=False)
iris = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                     columns=iris['feature_names'] + ['target'])

iris = iris.rename(index=str, columns={'sepal length (cm)': 'sepal_length', 'sepal width (cm)': 'sepal_width', 'target': 'species'})

这是一个有效的多元(二进制)logit:

Here is a working multivariate (binary) logit:

df = iris.loc[iris['species'].isin([0, 1])]
y = pd.Categorical(df['species']).codes
x = df[['sepal_length', 'sepal_width']].values

with pm.Model() as model_1:
      alpha = pm.Normal('alpha', mu=0, sd=10)
      beta = pm.Normal('beta', mu=0, sd=2, shape=x.shape[1])
      mu = alpha + pm.math.dot(x, beta)
      theta = 1 / (1 + pm.math.exp(-mu))
      y_ = pm.Bernoulli('yl', p=theta, observed=y)
      trace_1 = pm.sample(5000)

这是一个工作有序的logit(具有一个自变量):

Here is a working ordered logit (with one independent variable):

x = iris['sepal_length'].values
y = pd.Categorical(iris['species']).codes

with pm.Model() as model:
    cutpoints = pm.Normal("cutpoints", mu=[-2,2], sd=10, shape=2,
                          transform=pm.distributions.transforms.ordered)

    y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=x, observed=y)
    tr = pm.sample(1000)

这是我尝试进行多元有序logit的尝试,但失败了:

Here is my attempt at a multivariate ordered logit, which breaks:

x = iris[['sepal_length', 'sepal_width']].values
y = pd.Categorical(iris['species']).codes

with pm.Model() as model:
    cutpoints = pm.Normal("cutpoints", mu=[-2,2], sd=10, shape=2,
                          transform=pm.distributions.transforms.ordered)

    y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=x, observed=y)
    tr = pm.sample(1000)

我得到的错误是:"ValueError:除串联轴外,所有输入数组维都必须完全匹配."

The error I get is: "ValueError: all the input array dimensions except for the concatenation axis must match exactly."

这表明这是一个数据问题(x,y),但是数据看起来与多变量logit的数据相同,这是可行的.

This suggests it's a data problem (x, y), but the data looks the same as it does for the multivariate logit, which works.

如何修复有序的多变量logit,使其能够运行?

How can I fix the ordered multivariate logit so it will run?

推荐答案

我以前从未做过多元序数回归,但是似乎必须以两种方式解决建模问题:

I've never done multivariate ordinal regression before, but it seems like one must approach the modeling problem in either two ways:

  1. 在预测变量空间中进行分区,在这种情况下,您将需要剪切线/曲线而不是点.
  2. 在转换后的空间中进行分区,在该空间中,您已将预测变量空间投影到标量值,并且可以再次使用切点.

如果要使用pm.OrderedLogistic,则似乎必须使用后者,因为它似乎不支持开箱即用的多变量eta情况.

If you want to use the pm.OrderedLogistic it seems like you have to do the latter, since it doesn't appear to support a multivariate eta case out of the box.

这是我的目标,但是再次,我不确定这是标准方法.

Here's my stab at it, but again, I'm not sure this is a standard approach.

import numpy as np
import pymc3 as pm
import pandas as pd
import theano.tensor as tt
from sklearn.datasets import load_iris

# Load data
iris = load_iris(return_X_y=False)
iris = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                     columns=iris['feature_names'] + ['target'])
iris = iris.rename(index=str, columns={
    'sepal length (cm)': 'sepal_length', 
    'sepal width (cm)': 'sepal_width', 
    'target': 'species'})

# Prep input data
Y = pd.Categorical(iris['species']).codes
X = iris[['sepal_length', 'sepal_width']].values

# augment X for simpler regression expression
X_aug = tt.concatenate((np.ones((X.shape[0], 1)), X), axis=1)

# Model with sampling
with pm.Model() as ordered_mvlogit:
    # regression coefficients
    beta = pm.Normal('beta', mu=0, sd=2, shape=X.shape[1] + 1)

    # transformed space (univariate real)
    eta = X_aug.dot(beta)

    # points for separating categories
    cutpoints = pm.Normal("cutpoints", mu=np.array([-1,1]), sd=1, shape=2,
                          transform=pm.distributions.transforms.ordered)

    y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=eta, observed=Y)

    trace_mvordlogit = pm.sample(5000)

这似乎收敛得很好,并产生了不错的时间间隔

This seems to converge fine and yields decent intervals

如果您随后将betacutpoint均值运回到预测变量空间,则会得到以下划分,这似乎是合理的.但是,隔片的长度和宽度并不是分割的最佳选择.

If you then work the beta and cutpoint mean values back to predictor space, you get the following partitioning, which appears reasonable. However, sepal length and width aren't really the best for partitioning.

# Extract mean parameter values
b0, b1, b2 = trace_mvordlogit.get_values(varname='beta').mean(axis=0)
cut1, cut2 = trace_mvordlogit.get_values(varname='cutpoints').mean(axis=0)

# plotting parameters
x_min, y_min = X.min(axis=0)
x_max, y_max = X.max(axis=0)

buffer = 0.2
num_points = 37

# compute grid values
x = np.linspace(x_min - buffer, x_max + buffer, num_points)
y = np.linspace(y_min - buffer, y_max + buffer, num_points)

X_plt, Y_plt = np.meshgrid(x, y)
Z_plt = b0 + b1*X_plt + b2*Y_plt

# contour + scatter plots
plt.figure(figsize=(8,8))
plt.contourf(X_plt,Y_plt,Z_plt, levels=[-80, cut1, cut2, 50])
plt.scatter(iris.sepal_length, iris.sepal_width, c=iris.species)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.show()

您可以轻松地在模型中扩展eta,以包括交互作用和高阶项,以便最终的分类器切口可以是曲线而不是简单的直线.例如,这是二阶模型.

You could easily extend eta in the model to include interactions and higher-order terms, so that the final classifier cuts can be curves instead of simple lines. For example, here is the second order model.

from sklearn.preprocessing import scale

Y = pd.Categorical(iris['species']).codes

# scale X for better sampling
X = scale(iris[['sepal_length', 'sepal_width']].values)

# augment with intercept and second-order terms
X_aug = tt.concatenate((
    np.ones((X.shape[0], 1)), 
    X,
    (X[:,0]*X[:,0]).reshape((-1,1)),
    (X[:,1]*X[:,1]).reshape((-1,1)),
    (X[:,0]*X[:,1]).reshape((-1,1))), axis=1)

with pm.Model() as ordered_mvlogit_second:
    beta = pm.Normal('beta', mu=0, sd=2, shape=6)

    eta = X_aug.dot(beta)

    cutpoints = pm.Normal("cutpoints", mu=np.array([-1,1]), sd=1, shape=2,
                          transform=pm.distributions.transforms.ordered)

    y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=eta, observed=Y)

    trace_mvordlogit_second = pm.sample(tune=1000, draws=5000, chains=4, cores=4)

这很好地进行了采样,所有系数的HPD均为非零

This samples nicely and all the coefficients have non-zero HPDs

并且如上所述,您可以生成分类区域的图

And as above you can generate a plot of the classification regions

这篇关于在PyMC3中运行多元有序logit的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆