[Statsmodels]:如何获取statsmodel以返回OLS对象的pvalue? [英] [Statsmodels]: How can I get statsmodel to return the pvalue of an OLS object?

查看:1084
本文介绍了[Statsmodels]:如何获取statsmodel以返回OLS对象的pvalue?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对编程还很陌生,因此我开始使用python来熟悉数据分析和机器学习.

I'm quite new to programming and I'm jumping on python to get some familiarity with data analysis and machine learning.

我正在跟踪有关多元线性回归的向后消除的教程.这是现在的代码:

I am following a tutorial on backward elimination for a multiple linear regression. Here is the code right now:

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

#Taking care of missin' data
#np.set_printoptions(threshold=100) 
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3]) 

#Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X = LabelEncoder()
X[:, 3] = labelEncoder_X.fit_transform(X[:, 3])
onehotecnoder = OneHotEncoder(categorical_features = [3])
X = onehotecnoder.fit_transform(X).toarray()

#Avoid the Dummy Variables Trap
X = X[:, 1:]

#Splitting data in train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

#Fitting multiple Linear Regression to Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Predict Test set
regressor = regressor.predict(X_test)

#Building the optimal model using Backward Elimination
import statsmodels.formula.api as sm
a = 0
b = 0
a, b = X.shape
X = np.append(arr = np.ones((a, 1)).astype(int), values = X, axis = 1)
print (X.shape)

X_optimal = X[:,[0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,1,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,3,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,3]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()

现在,执行消除的方法对我来说似乎确实是手动的,并且我想实现它的自动化.为了做到这一点,我想知道是否有一种方法可以让我以某种方式返回回归器的pvalue(例如,是否有一种方法可以在statsmodels中做到这一点).这样,我认为我应该能够循环X_optimal数组的功能,并查看pvalue是否大于我的SL并消除它.

Now, the way the elimination is performed seems really manual to me, and I'd like to automate it. In order to do so I'd like to know if there is a way for me to have the pvalue of the regressor returned somehow (e.g if there is a method that does that in statsmodels). In that way I think I should be able to loop the features of the X_optimal array and see if the pvalue is greater than my SL and eliminate it.

谢谢!

推荐答案

遇到相同的问题.

您可以通过

regressor_OLS.pvalues 

它们以科学计数形式存储为float64数组.我对python有点陌生,我敢肯定有更干净,更优雅的解决方案,但这是我的:

They're stored as an array of float64s in scientific notation. I'm a bit new to python and I'm sure there are cleaner, more elegant solutions, but this was mine:

sigLevel = 0.05

X_opt = X[:,[0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
pVals = regressor_OLS.pvalues

while np.argmax(pVals) > sigLevel:
    droppedDimIndex = np.argmax(regressor_OLS.pvalues)
    keptDims = list(range(len(X_opt[0])))
    keptDims.pop(droppedDimIndex)
    print("pval of dim removed: " + str(np.argmax(pVals)))
    X_opt = X_opt[:,keptDims]
    regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
    pVals = regressor_OLS.pvalues
    print(str(len(pVals)-1) + " dimensions remaining...")
    print(pVals)

regressor_OLS.summary()

这篇关于[Statsmodels]:如何获取statsmodel以返回OLS对象的pvalue?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆