如何在 statsmodels 中使用 gamma GLM 的尺度和形状参数 [英] How to use scale and shape parameters of gamma GLM in statsmodels

查看:80
本文介绍了如何在 statsmodels 中使用 gamma GLM 的尺度和形状参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任务

我有这样的数据:

我想使用 statsmodels 将一个广义线性模型 (glm) 拟合到伽马族中.使用这个模型,对于我的每个观察,我想计算观察到小于(或等于)该值的值的概率.换句话说,我想计算:

<块引用>

P(y <= y_i | x_i)

我的问题

  • 如何从 statsmodels 中拟合的 glm 中获取形状和比例参数?根据

    目前预测的概率似乎都很高.图中的红线是预测均值.但即使对于这条线以下的点,预测的累积概率也约为 80%.这让我怀疑我使用的比例参数是否确实正确.

    解决方案

    在 R 中,您可以使用 1/dispersion 作为估计的形状获得(检查此

    然后我们像您一样拟合模型:

    y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()mu = mod.predict(exog=X)shape_from_model = 1/mod.scale概率 = [gamma(shape_from_model, scale=m_i/shape_from_model).cdf(y_i) for m_i, y_i in zip(mu,myData['y'])]

    和情节:

    fig, ax = plt.subplots()im = ax.scatter(myData[x"],myData[y"],c=probabilities)im = ax.scatter(myData['x'],mu,c="r",s=1)fig.colorbar(im, ax=ax)

    The task

    I have data that looks like this:

    I want to fit a generalized linear model (glm) to this from a gamma family using statsmodels. Using this model, for each of my observations I want to calculate the probability of observing a value that is smaller than (or equal to) that value. In other words I want to calculate:

    P(y <= y_i | x_i)

    My questions

    • How do I get the shape and scale parameters from the fitted glm in statsmodels? According to this question the scale parameter in statsmodels is not parameterized in the normal way. Can I use it directly as input to a gamma distribution in scipy? Or do I need a transformation first?

    • How do I use these parameters (shape and scale) to get the probabilities? Currently I'm using scipy to generate a distribution for each x_i and get the probability from that. See implementation below.

    My current implementation

    import scipy.stats as stat
    import patsy
    import statsmodels.api as sm
    
    # Generate data in correct form
    y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')
    
    # Fit model with gamma family and log link
    mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()
    
    # Predict mean
    myData['mu'] = mod.predict(exog=X) 
    
    # Predict probabilities (note that for a gamma distribution mean = shape * scale)
    probabilities = np.array(
        [stat.gamma(m_i/mod.scale, scale=mod.scale).cdf(y_i) for m_i, y_i in zip(myData['mu'], myData['y'])]
    )
    
    

    However, when I perform this procedure I get the following result:

    Currently the predicted probabilities all seem really high. The red line in the graph is the predicted mean. But even for points below this line the predicted cumulative probability is around 80%. This makes me wonder whether the scale parameter I used is indeed the correct one.

    解决方案

    In R, you can obtained as estimate of the shape using 1/dispersion (check this post).The naming of the dispersion estimate in statsmodels is a unfortunately scale. So you did to take the reciprocal of this to get the shape estimate. I show it with an example below:

    values = gamma.rvs(2,scale=5,size=500)
    fit = sm.GLM(values, np.repeat(1,500), family=sm.families.Gamma(sm.families.links.log())).fit()
    

    This is an intercept only model, and we check the intercept and dispersion (named scale):

    [fit.params,fit.scale]
    [array([2.27875973]), 0.563667465203953]
    

    So the mean is exp(2.2599) = 9.582131 and if we use shape as 1/dispersion , shape = 1/0.563667465203953 = 1.774096 which is what we simulated.

    If I use a simulated dataset, it works perfectly fine. This is what it looks like, with a shape of 10:

    from scipy.stats import gamma
    import numpy as np
    import matplotlib.pyplot as plt
    import patsy
    import statsmodels.api as sm
    import pandas as pd
    
    _shape = 10
    myData = pd.DataFrame({'x':np.random.uniform(0,10,size=500)})
    myData['y'] = gamma.rvs(_shape,scale=np.exp(-myData['x']/3 + 0.5)/_shape,size=500)
    
    myData.plot("x","y",kind="scatter")
    

    Then we fit the model like you did:

    y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')
    mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()
    mu = mod.predict(exog=X) 
    
    shape_from_model = 1/mod.scale
    
    probabilities = [gamma(shape_from_model, scale=m_i/shape_from_model).cdf(y_i) for m_i, y_i in zip(mu,myData['y'])]
    

    And plot:

    fig, ax = plt.subplots()
    im = ax.scatter(myData["x"],myData["y"],c=probabilities)
    im = ax.scatter(myData['x'],mu,c="r",s=1)
    fig.colorbar(im, ax=ax)
    

    这篇关于如何在 statsmodels 中使用 gamma GLM 的尺度和形状参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆