R 中的 lm 与 Python 中的 statsmodel OLS 的不同结果 [英] Different results from lm in R vs. statsmodel OLS in Python

查看:33
本文介绍了R 中的 lm 与 Python 中的 statsmodel OLS 的不同结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Python 新手,曾经是 R 用户.当我在 R 中构建一个简单的回归模型与在 iPython 中执行相同的事情时,我从一个简单的回归模型中得到非常不同的结果.

I'm new to Python and have been an R User. I am getting VERY different results from a simple regression model when I build it in R vs. when I execute the same thing in iPython.

R 平方、P 值、系数的重要性 - 没有匹配项.我是读错了输出还是犯了其他一些基本错误?

The R-Squared, The P Value , The significance of the co-efficients - nothing matches . Am I reading the output wrong or making some other fundamental error?

以下是我的代码和结果:

Below are my codes for both and results:

R 代码

str(df_nv)
Classes 'tbl_df', 'tbl' and 'data.frame':   81 obs. of  2 variables:
 $ Dependent Variabls       : num  733 627 405 353 434 556 381 558 612 901 ...
 $ Independent Variable: num  0.193 0.167 0.169 0.14 0.145 ...


summary(lm(`Dependent Variable` ~ `Independent Variable`, data = df_nv))

Call:
    lm(formula = `Dependent Variable` ~ `Independent Variable`, data = df_nv)


Residuals:
    Min      1Q  Median      3Q     Max 
-501.18 -139.20  -82.61  -15.82 2136.74 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)   
(Intercept)               478.2      148.2   3.226  0.00183 **
`Independent Variable`   -196.1     1076.9  -0.182  0.85601   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 381.5 on 79 degrees of freedom
Multiple R-squared:  0.0004194, Adjusted R-squared:  -0.01223 
F-statistic: 0.03314 on 1 and 79 DF,  p-value: 0.856

iPython Notebook 代码

df_nv.dtypes

Dependent Variable           float64
Independent Variable         float64
dtype: object

model = sm.OLS(df_nv['Dependent Variable'], df_nv['Independent Variable'])

results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:  Dependent Variable  R-squared:  0.537
Model:  OLS Adj. R-squared: 0.531
Method: Least Squares   F-statistic:    92.63
Date:   Fri, 20 Jan 2017    Prob (F-statistic): 5.23e-15
Time:   09:08:54    Log-Likelihood: -600.40
No. Observations:   81  AIC:    1203.
Df Residuals:   80  BIC:    1205.
Df Model:   1       
Covariance Type:    nonrobust       
coef    std err t   P>|t|   [95.0% Conf. Int.]
Independent Variable    3133.1825   325.537 9.625   0.000   2485.342 3781.023
Omnibus:    89.595  Durbin-Watson:  1.940
Prob(Omnibus):  0.000   Jarque-Bera (JB):   980.289
Skew:   3.489   Prob(JB):   1.36e-213
Kurtosis:   18.549  Cond. No.   1.00

作为参考,R 和 Python 中的数据帧头:

For reference, head of dataframe in both R and Python :

R:

head(df_nv)
  Dependent Variable Independent Variable
          <dbl>                <dbl>
1           733            0.1932367
2           627            0.1666667
3           405            0.1686183
4           353            0.1398601
5           434            0.1449275
6           556            0.1475410

蟒蛇:

df_nv.head()

    Dependent Variable  Independent Variable
5292    733.0   0.193237
5320    627.0   0.166667
5348    405.0   0.168618
5404    353.0   0.139860
5460    434.0   0.144928

推荐答案

以下是使用python pandasgapminder数据集上运行线性回归的结果(使用statsmodels.formula.api) 和 R,它们完全一样:

The following is the result of running linear regression on the gapminder dataset using python pandas (use statsmodels.formula.api) and R, they are exactly same:

df <- read.csv('gapminder.csv')
df <- df[c('internetuserate', 'urbanrate')]
df <- df[complete.cases(df),]
dim(df)
# [1] 190   2
m <- lm(internetuserate~urbanrate, df)
summary(m)
#Call:
#lm(formula = internetuserate ~ urbanrate, data = df)

#Residuals:
#    Min      1Q  Median      3Q     Max 
#-51.474 -15.857  -3.954  14.305  74.590 

#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept) -4.90375    4.11485  -1.192    0.235    
#urbanrate    0.72022    0.06753  10.665   <2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
#Residual standard error: 22.03 on 188 degrees of freedom
#Multiple R-squared:  0.3769,   Adjusted R-squared:  0.3736 
#F-statistic: 113.7 on 1 and 188 DF,  p-value: < 2.2e-16

python 代码

import pandas
import statsmodels.formula.api as smf 
data = pandas.read_csv('gapminder.csv')
data = data[['internetuserate', 'urbanrate']]
data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data = data.dropna(axis=0, how='any')
print data.shape
# (190, 2)
reg1 = smf.ols('internetuserate ~  urbanrate', data=data).fit()
print (reg1.summary())
#                           OLS Regression Results
#==============================================================================
#Dep. Variable:        internetuserate   R-squared:                       0.377
#Model:                            OLS   Adj. R-squared:                  0.374
#Method:                 Least Squares   F-statistic:                     113.7
#Date:                Fri, 20 Jan 2017   Prob (F-statistic):           4.56e-21
#Time:                        23:27:50   Log-Likelihood:                -856.14
#No. Observations:                 190   AIC:                             1716.
#Df Residuals:                     188   BIC:                             1723.
#Df Model:                           1
#Covariance Type:            nonrobust
#================================================================================
#                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
#    ------------------------------------------------------------------------------
#    Intercept     -4.9037      4.115     -1.192      0.235       -13.021     3.213
#    urbanrate      0.7202      0.068     10.665      0.000         0.587     0.853
#================================================================================
#    Omnibus:                       10.750   Durbin-Watson:                   2.097
#    Prob(Omnibus):                  0.005   Jarque-Bera (JB):               10.990
#    Skew:                           0.574   Prob(JB):                      0.00411
#    Kurtosis:                       3.262   Cond. No.                         157.
#==============================================================================

这篇关于R 中的 lm 与 Python 中的 statsmodel OLS 的不同结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆