评估统计模型特定分类的斜率和误差 [英] Evaluate slope and error for specific category for statsmodels ols fit

查看:95
本文介绍了评估统计模型特定分类的斜率和误差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框df,其中包含以下字段:weightlengthanimal.前两个是连续变量,而animal是具有值catdogsnake的类别变量.

I have a dataframe df with the following fields: weight, length, and animal. The first 2 are continuous variables, while animal is a categorical variable with the values cat, dog, and snake.

我想估计体重和长度之间的关系,但这需要以动物的类型为条件,因此我将长度变量与animal分类变量进行交互.

I'd like to estimate a relationship between weight and length, but this needs to be conditioned on the type of animal, so I interact the length variable with the animal categorical variable.

model = ols(formula='weight ~ length * animal', data=df)
results = model.fit()

如何以编程方式提取重量和长度之间关系的斜率,例如蛇?我了解如何手动执行此操作:将length的系数添加到animal[T.snake]:length的系数.但这有点麻烦和手动,需要我专门处理基本情况,因此我想自动提取此信息.

How can I programmatically extract the slope of the relationship between weight and length for e.g. snakes? I understand how to do this manually: add the coefficient for length to the coefficient for animal[T.snake]:length. But this is somewhat cumbersome and manual, and requires me to handle the base case specially, so I'd like to extract this information automatically.

此外,我想估算此斜率上的误差.我相信我了解如何通过结合标准误差和协方差来计算此值(更确切地说,在此处进行计算).但这比上面还要麻烦,我同样想知道是否存在提取此信息的捷径.

Furthermore, I'd like to estimate the error on this slope. I believe I understand how to calculate this by combining the standard errors and covariances (more precisely, performing the calculation here). But this is even more cumbersome than the above, and I'm similarly wondering if there's a shortcut to extract this information.

我的手动计算方法如下.

My manual method to calculate these follows.

编辑(06/22/2015):以下我的原始代码中似乎存在错误,无法计算错误. user333700的答案中计算出的标准误差与我计算出的误差不同,但是我没有花时间来找出原因.

EDIT (06/22/2015): there seems to be an error in my original code below for calculating errors. The standard errors as calculated in user333700's answer are different from the ones I calculate, but I haven't invested any time in figuring out why.

def get_contained_animal(animals, p):
    # This relies on parameters of the form animal[T.snake]:length.
    for a in animals:
        if a in p:
            return a
    return None

animals = ['cat', 'dog', 'snake']
slopes = {}
errors = {}
for animal in animals:
    slope = 0.
    params = []
    # If this param is related to the length variable and
    # the animal in question, add it to the slope.
    for param, val in results.params.iteritems():
        ac = get_contained_animal(animals, param)
        if (param == 'length' or 
            ('length' in param and 
             ac is None or ac == animal)):
            params.append(param)
            slope += val

    # Calculate the overall error by adding standard errors and 
    # covariances.
    tot_err = 0.
    for i, p1 in enumerate(params):
        tot_err += results.bse[p1]*results.bse[p1]
        for j, p2 in enumerate(params[i:]):
            # add covariance of these parameters
            tot_err += 2*results.cov_params()[p1][p2]

    slopes[animal] = slope
    errors[animal] = tot_err**0.5

这段代码似乎有些过分,但是在我的实际用例中,我有一个连续变量与两个单独的类别变量进行交互,每个类别变量都具有大量类别(以及模型中的其他术语,我需要忽略)为此目的.

This code might seem like overkill, but in my real-world use case I have a continuous variable interacting with two separate categorical variables, each with a large number of categories (along with other terms in the model that I need to ignore for these purposes).

推荐答案

非常简短的背景:

对此的一般问题是,如果我们更改解释变量,保持其他解释变量固定或对其平均,那么预测将如何变化.

The general question for this is how does the prediction change if we change on of the explanatory variables, holding other explanatory variables fixed or averaging over those.

在非线性离散模型中,尽管未针对分类变量的更改实施该方法,但有一种特殊的Margins方法可以对此进行计算.

In the nonlinear discrete models, there is a special Margins method that calculates this, although it is not implemented for changes in categorical variables.

在线性模型中,预测和预测中的变化只是估计参数的线性函数,我们可以(误用)t_test为我们计算效果,其标准误差和置信区间.

In the linear model, the prediction and change in prediction is just a linear function of the estimated parameters, and we can (mis)use t_test to calculate the effect, its standard error and confidence interval for us.

(此外:statsmodel的工作中还有更多的辅助方法,可以使这样的预测和保证金计算更加容易,并且很可能在今年晚些时候可用.)

(Aside: There are more helper methods in the works for statsmodels to make prediction and margin calculations like this easier and will be available most likely later in the year.)

作为以下代码的简要说明:

As brief explanation of the following code:

  • 我做了一个类似的例子.
  • 我为每种动物定义了长度= 1或2的解释变量
  • 然后,我计算这些解释变量之间的差异
  • 这定义了可以在t_test中使用的线性组合或参数对比.

最后,将我与预测的结果进行比较,以检查我没有犯任何明显的错误. (我认为这是正确的,但是我写得很快.)

Finally, I compare with the result from predict to check that I didn't make any obvious mistakes. (I assume this is correct but I had written it pretty fast.)

import numpy as np
import pandas as pd

from statsmodels.regression.linear_model import OLS

np.random.seed(2)
nobs = 20
animal_names = np.array(['cat', 'dog', 'snake'])
animal_idx = np.random.random_integers(0, 2, size=nobs)
animal = animal_names[animal_idx]
length = np.random.randn(nobs) + animal_idx
weight = np.random.randn(nobs) + animal_idx + length

data = pd.DataFrame(dict(length=length, weight=weight, animal=animal))

res = OLS.from_formula('weight ~ length * animal', data=data).fit()
print(res.summary())


data_predict1 = data = pd.DataFrame(dict(length=np.ones(3), weight=np.ones(3), 
                                        animal=animal_names))

data_predict2 = data = pd.DataFrame(dict(length=2*np.ones(3), weight=np.ones(3), 
                                        animal=animal_names))

import patsy
x1 = patsy.dmatrix('length * animal', data_predict1)
x2 = patsy.dmatrix('length * animal', data_predict2)

tt = res.t_test(x2 - x1)
print(tt.summary(xname=animal_names.tolist()))

最后一次打印的结果是

                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
cat            1.0980      0.280      3.926      0.002         0.498     1.698
dog            0.9664      0.860      1.124      0.280        -0.878     2.811
snake          1.5930      0.428      3.720      0.002         0.675     2.511

如果给定动物类型的长度从1增加到2:我们可以使用预测来验证结果并比较预测体重的差异.

We can verify the results by using predict and compare the difference in predicted weight if the length for a given animal type increases from 1 to 2:

>>> [res.predict({'length': 2, 'animal':[an]}) - res.predict({'length': 1, 'animal':[an]}) for an in animal_names]
[array([ 1.09801656]), array([ 0.96641455]), array([ 1.59301594])]
>>> tt.effect
array([ 1.09801656,  0.96641455,  1.59301594])

注意:我忘记为随机数添加种子,并且这些数字无法复制.

Note: I forgot to add a seed for the random numbers and the numbers cannot be replicated.

这篇关于评估统计模型特定分类的斜率和误差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆