[Statsmodels]:如何获取statsmodel以返回OLS对象的pvalue? [英] [Statsmodels]: How can I get statsmodel to return the pvalue of an OLS object?
问题描述
我对编程还很陌生,因此我开始使用python来熟悉数据分析和机器学习.
I'm quite new to programming and I'm jumping on python to get some familiarity with data analysis and machine learning.
我正在跟踪有关多元线性回归的向后消除的教程.这是现在的代码:
I am following a tutorial on backward elimination for a multiple linear regression. Here is the code right now:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
#Taking care of missin' data
#np.set_printoptions(threshold=100)
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
#Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X = LabelEncoder()
X[:, 3] = labelEncoder_X.fit_transform(X[:, 3])
onehotecnoder = OneHotEncoder(categorical_features = [3])
X = onehotecnoder.fit_transform(X).toarray()
#Avoid the Dummy Variables Trap
X = X[:, 1:]
#Splitting data in train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
#Fitting multiple Linear Regression to Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#Predict Test set
regressor = regressor.predict(X_test)
#Building the optimal model using Backward Elimination
import statsmodels.formula.api as sm
a = 0
b = 0
a, b = X.shape
X = np.append(arr = np.ones((a, 1)).astype(int), values = X, axis = 1)
print (X.shape)
X_optimal = X[:,[0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,1,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,3,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,3]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
现在,执行消除的方法对我来说似乎确实是手动的,并且我想实现它的自动化.为了做到这一点,我想知道是否有一种方法可以让我以某种方式返回回归器的pvalue(例如,是否有一种方法可以在statsmodels中做到这一点).这样,我认为我应该能够循环X_optimal数组的功能,并查看pvalue是否大于我的SL并消除它.
Now, the way the elimination is performed seems really manual to me, and I'd like to automate it. In order to do so I'd like to know if there is a way for me to have the pvalue of the regressor returned somehow (e.g if there is a method that does that in statsmodels). In that way I think I should be able to loop the features of the X_optimal array and see if the pvalue is greater than my SL and eliminate it.
谢谢!
推荐答案
遇到相同的问题.
您可以通过
regressor_OLS.pvalues
它们以科学计数形式存储为float64数组.我对python有点陌生,我敢肯定有更干净,更优雅的解决方案,但这是我的:
They're stored as an array of float64s in scientific notation. I'm a bit new to python and I'm sure there are cleaner, more elegant solutions, but this was mine:
sigLevel = 0.05
X_opt = X[:,[0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
pVals = regressor_OLS.pvalues
while np.argmax(pVals) > sigLevel:
droppedDimIndex = np.argmax(regressor_OLS.pvalues)
keptDims = list(range(len(X_opt[0])))
keptDims.pop(droppedDimIndex)
print("pval of dim removed: " + str(np.argmax(pVals)))
X_opt = X_opt[:,keptDims]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
pVals = regressor_OLS.pvalues
print(str(len(pVals)-1) + " dimensions remaining...")
print(pVals)
regressor_OLS.summary()
这篇关于[Statsmodels]:如何获取statsmodel以返回OLS对象的pvalue?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!