在statsmodels OLS类中使用分类变量 [英] Using categorical variables in statsmodels OLS class

查看:137
本文介绍了在statsmodels OLS类中使用分类变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用statsmodels OLS类创建一个多元回归模型.考虑以下数据集:

I want to use statsmodels OLS class to create a multiple regression model. Consider the following dataset:

import statsmodels.api as sm
import pandas as pd
import numpy as np

dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
  'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90} 

df = pd.DataFrame.from_dict(dict)

x = data[['debt_ratio', 'industry']]
y = data['cash_flow']

def reg_sm(x, y):
    x = np.array(x).T
    x = sm.add_constant(x)
    results = sm.OLS(endog = y, exog = x).fit()
    return results

当我运行以下代码时:

reg_sm(x, y)

我收到以下错误:

TypeError: '>=' not supported between instances of 'float' and 'str'

我尝试将industry变量转换为分类变量,但是仍然出现错误.我没办法了.

I've tried converting the industry variable to categorical, but I still get an error. I'm out of options.

推荐答案

您在正确的路径上,可以转换为Categorical dtype.但是,将DataFrame转换为NumPy数组后,将获得object dtype(NumPy数组整体上是一种统一的类型).这意味着各个值仍然是str的基础,而回归肯定不希望如此.

You're on the right path with converting to a Categorical dtype. However, once you convert the DataFrame to a NumPy array, you get an object dtype (NumPy arrays are one uniform type as a whole). This means that the individual values are still underlying str which a regression definitely is not going to like.

您可能想做的是将对象弄糊涂此功能.代替分解,它将有效将变量视为连续变量,您需要保持某种分类的外观:

What you might want to do is to dummify this feature. Instead of factorizing it, which would effectively treat the variable as continuous, you want to maintain some semblance of categorization:

>>> import statsmodels.api as sm
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)
>>> data = {
...     'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
...    'debt_ratio':np.random.randn(5),
...    'cash_flow':np.random.randn(5) + 90
... }
>>> data = pd.DataFrame.from_dict(data)
>>> data = pd.concat((
...     data,
...     pd.get_dummies(data['industry'], drop_first=True)), axis=1)
>>> # You could also use data.drop('industry', axis=1)
>>> # in the call to pd.concat()
>>> data
         industry  debt_ratio  cash_flow  finance  hospitality  mining  transportation
0          mining    0.357440  88.856850        0            0       1               0
1  transportation    0.377538  89.457560        0            0       0               1
2     hospitality    1.382338  89.451292        0            1       0               0
3         finance    1.175549  90.208520        1            0       0               0
4   entertainment   -0.939276  90.212690        0            0       0               0

现在,您可以使用statsmodels可以更好地使用的dtypes. drop_first的目的是避免虚拟陷阱:

Now you have dtypes that statsmodels can better work with. The purpose of drop_first is to avoid the dummy trap:

>>> y = data['cash_flow']
>>> x = data.drop(['cash_flow', 'industry'], axis=1)
>>> sm.OLS(y, x).fit()
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x115b87cf8>

最后,只有一个小指针:它有助于避免使用隐藏了内置对象类型的名称来命名引用,例如dict.

Lastly, just a small pointer: it helps to try to avoid naming references with names that shadow built-in object types, such as dict.

这篇关于在statsmodels OLS类中使用分类变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆