R和Python中线性回归的差异 [英] Differences in Linear Regression in R and Python

查看:103
本文介绍了R和Python中线性回归的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将线性回归R结果与python的结果进行匹配

I was trying to match the linear regression R results with that of python

匹配每个自变量的系数

以及下面的代码:

数据已上传.

https://www.dropbox.com/s/oowe4irm9332s78 /X.csv?dl=0

https://www.dropbox.com/s/79scp54unzlbwyk /Y.csv?dl=0

R代码:

#define pathname = " "

X <- read.csv(file.path(pathname,"X.csv"),stringsAsFactors = F)
Y <- read.csv(file.path(pathname,"Y.csv"),stringsAsFactors = F)

d1 = Y$d1

reg_data <- cbind(d1, X)
head(reg_data)

reg_model <- lm(d1 ~ .,data=reg_data)

names(which(is.na(reg_model$coefficients)))

reg_model$coefficients

R结果

> summary(reg_model)

Call:
lm(formula = d1 ~ ., data = reg_data)

Residuals:
ALL 60 residuals are 0: no residual degrees of freedom!

Coefficients: (14 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -67.37752         NA      NA       NA
v1            1.30214         NA      NA       NA
v2           -2.93118         NA      NA       NA
v3            7.58902         NA      NA       NA
v4           11.88570         NA      NA       NA
v5            1.60622         NA      NA       NA
v6            3.71528         NA      NA       NA
v7           -9.34627         NA      NA       NA
v8           -3.84694         NA      NA       NA
v9           -2.51332         NA      NA       NA
v10           4.22403         NA      NA       NA
v11          -9.70126         NA      NA       NA
v12                NA         NA      NA       NA
v13           4.67276         NA      NA       NA
v14          -6.57924         NA      NA       NA
v15          -3.68065         NA      NA       NA
v16           5.25168         NA      NA       NA
v17          14.60444         NA      NA       NA
v18          16.00679         NA      NA       NA
v19          24.79622         NA      NA       NA
v20          13.85774         NA      NA       NA
v21           2.16022         NA      NA       NA
v22         -36.65361         NA      NA       NA
v23           2.26554         NA      NA       NA
v24                NA         NA      NA       NA
v25                NA         NA      NA       NA
v26           7.00981         NA      NA       NA
v27           0.88904         NA      NA       NA
v28           0.34400         NA      NA       NA
v29          -5.27597         NA      NA       NA
v30           5.21034         NA      NA       NA
v31           6.79640         NA      NA       NA
v32           2.96346         NA      NA       NA
v33          -1.52702         NA      NA       NA
v34          -2.74632         NA      NA       NA
v35          -2.36952         NA      NA       NA
v36          -7.76547         NA      NA       NA
v37           2.19630         NA      NA       NA
v38           1.63336         NA      NA       NA
v39           0.69485         NA      NA       NA
v40           0.37379         NA      NA       NA
v41          -0.09107         NA      NA       NA
v42           2.06569         NA      NA       NA
v43           1.57505         NA      NA       NA
v44           2.70535         NA      NA       NA
v45           1.17634         NA      NA       NA
v46         -10.51141         NA      NA       NA
v47          -1.15060         NA      NA       NA
v48           2.87353         NA      NA       NA
v49           3.37740         NA      NA       NA
v50          -5.89816         NA      NA       NA
v51           0.85851         NA      NA       NA
v52           3.73929         NA      NA       NA
v53           4.93265         NA      NA       NA
v54           3.45650         NA      NA       NA
v55           0.12382         NA      NA       NA
v56          -0.21171         NA      NA       NA
v57           4.37199         NA      NA       NA
v58           3.21456         NA      NA       NA
v59           0.09012         NA      NA       NA
v60          -0.85414         NA      NA       NA
v61          -3.29856         NA      NA       NA
v62           4.38842         NA      NA       NA
v63                NA         NA      NA       NA
v64                NA         NA      NA       NA
v65                NA         NA      NA       NA
v66                NA         NA      NA       NA
v67                NA         NA      NA       NA
v68                NA         NA      NA       NA
v69                NA         NA      NA       NA
v70                NA         NA      NA       NA
v71                NA         NA      NA       NA
v72                NA         NA      NA       NA
v73                NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 59 and 0 DF,  p-value: NA

Python代码:

Y = pd.read_csv(pathname+"Y.csv")
X = pd.read_csv(pathname+"X.csv")

lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lr.fit(X, Y['d1'])

(list(zip(lr.coef_, X)))

lr.intercept_

Python结果:

intercept = 29.396033164254106

[(-2.4463986167806304, 'v1'),
 (-1.6293010275307021, 'v2'),
 (0.89089949009506508, 'v3'),
 (-3.1021251646895251, 'v4'),
 (-1.7707078771936109, 'v5'),
 (-2.0474705122225636, 'v6'),
 (-1.5537181337496202, 'v7'),
 (-1.6391241229716156, 'v8'),
 (-1.2981646048517046, 'v9'),
 (0.89221826294889328, 'v10'),
 (-0.56694104645951571, 'v11'),
 (2.042810365310288e-14, 'v12'),
 (-2.0312478672439052, 'v13'),
 (-1.5617121392788413, 'v14'),
 (0.4583365939498274, 'v15'),
 (0.8840538748922967, 'v16'),
 (-5.5952681002058871, 'v17'),
 (2.4937042448512892, 'v18'),
 (0.45806845189176543, 'v19'),
 (-1.1648810657830406, 'v20'),
 (-1.7800004329275585, 'v21'),
 (-5.0132817522704816, 'v22'),
 (3.6862778096189266, 'v23'),
 (2.7533531010703882e-14, 'v24'),
 (1.2150003225741557e-14, 'v25'),
 (0.94669823515018103, 'v26'),
 (-0.3082823207975679, 'v27'),
 (0.53619247380957358, 'v28'),
 (-1.1339902793546781, 'v29'),
 (1.9657159583080186, 'v30'),
 (-0.63200501460653324, 'v31'),
 (1.4741013580918978, 'v32'),
 (-2.4448418291953313, 'v33'),
 (-2.0787115960875036, 'v34'),
 (0.22492914212063603, 'v35'),
 (-0.75136276693004922, 'v36'),
 (1.2838658951186761, 'v37'),
 (0.5816277993227944, 'v38'),
 (-0.11270569554555088, 'v39'),
 (-0.13430982360936233, 'v40'),
 (-3.3189296496897662, 'v41'),
 (-0.452575588270415, 'v42'),
 (6.1329755709937519, 'v43'),
 (0.61559185634634817, 'v44'),
 (-1.206315459828555, 'v45'),
 (-3.7452010299772009, 'v46'),
 (-1.1472174665136678, 'v47'),
 (2.8960489381172652, 'v48'),
 (0.0090220136972478659, 'v49'),
 (-5.264918363314754, 'v50'),
 (1.2194758337662015, 'v51'),
 (2.78655271320092, 'v52'),
 (3.106513852668896, 'v53'),
 (3.5181252502607929, 'v54'),
 (-0.34426523770507278, 'v55'),
 (-0.48792823932479878, 'v56'),
 (0.12284460490031779, 'v57'),
 (1.6860388628044991, 'v58'),
 (1.2823067194737174, 'v59'),
 (2.8352263554153665, 'v60'),
 (-1.304336378501032, 'v61'),
 (0.55226132316435139, 'v62'),
 (1.5416988124754771, 'v63'),
 (-0.2605804175310813, 'v64'),
 (1.2489066081702334, 'v65'),
 (-0.44469553013696161, 'v66'),
 (-1.4102990055550157, 'v67'),
 (3.8150423259022639, 'v68'),
 (0.12039684410168072, 'v69'),
 (-1.340699466779357, 'v70'),
 (1.7066389124439212, 'v71'),
 (0.50470752944860442, 'v72'),
 (1.0024872633969766, 'v73')]

但是不匹配.请帮忙.

注意:它与以下示例匹配

Note:It matched for below example

http://davidcoallier.com/blog/linear-从r回归到python/

推荐答案

tl; dr 如果您想在Python中复制R的策略,您可能必须自己将其实现为R做一些其他地方没有广泛使用的聪明东西.

tl;dr if you want to replicate R's strategy in Python you're probably going to have to implement it yourself, as R does some clever stuff that's not widely available elsewhere.

作为参考(由于到目前为止仅在注释中提到),这是一个不适定/秩不足的拟合,当预测变量比响应多时,总是会发生这种情况(p> n:在这种情况下,p = 73 ,n = 61),并且通常是在存在很多分类响应和/或实验设计受到某种方式限制的情况下.处理这些情况以便获得答案,这意味着任何事情通常都需要认真思考和/或先进的技术(例如惩罚性回归:请参见

For reference (since mentioned so far only in comments), this is an ill-posed/rank-deficient fit, which will always happen when there are more predictor variables than responses (p>n: in this case p=73, n=61), and often when there are many categorical responses and/or the experimental design is limited in some way. Dealing with these situations so as to get an answer that means anything at all typically requires careful thought and/or advanced techniques (e.g. penalized regression: see refs to lasso and ridge regression in the Wikipedia article on linear regression).

处理这种情况的最幼稚的方法是将其全部放入标准线性代数中,并希望没有任何坏的情况,这显然是python的statsmodels包所做的事情:来自

The most naive way to handle this situation is to throw it all in to the standard linear algebra and hope that nothing breaks too badly, which is apparently what python's statsmodels package does: from the pitfalls document:

  • 矩阵不足的矩阵不会引发错误.
  • 几乎具有理想的多重共线性或条件不佳的设计矩阵的情况可能会产生数值不稳定的结果.如果这不是所需的行为,则用户需要手动检查矩阵的等级或条件编号.
  • Rank deficient matrices will not raise an error.
  • Cases of almost perfect multicollinearity or ill-conditioned design matrices might produce numerically unstable results. Users need to manually check the rank or condition number of the matrix if this is not the desired behavior.

下一个最好的事情(当共线性度为时,这是合理的)是在进行线性代数时合理地旋转,即重新安排计算问题,以便可以保留共线部分出去.这就是R所做的;为此,R代码的作者必须修改标准LINPACK例程.

The next best thing (reasonable when there is a small degree of collinearity) is to pivot sensibly when doing the linear algebra, that is, rearrange the computational problem so that the collinear parts can be left out. That's what R does; in order to do so, the authors of the R code had to modify the standard LINPACK routines.

基础代码带有此注释/说明:

c dqrdc2使用户主变换来计算qr
c通过p矩阵x对n进行因子分解.有限的列
基于归约列的2-范数的c透视策略
c将范数接近零的列移到
的右边缘 c x矩阵.这种策略意味着顺序的
c自由度效应可以自然地计算出来.

c dqrdc2 uses householder transformations to compute the qr
c factorization of an n by p matrix x. a limited column
c pivoting strategy based on the 2-norms of the reduced columns
c moves columns with near-zero norm to the right-hand edge of
c the x matrix. this strategy means that sequential one
c degree-of-freedom effects can be computed in a natural way.

我非常担心以这种方式修改linpack代码.
c如果您是计算线性代数大师并且您真的
c了解如何解决此问题,请随时
c建议对此代码进行改进.

c i am very nervous about modifying linpack code in this way.
c if you are a computational linear algebra guru and you really
c understand how to solve this problem please feel free to
c suggest improvements to this code.

此代码(和注释)自1998年以来一直存在于R的代码库中;我很想知道是谁最初写的(基于

This code (and comment) have been in R's code base since 1998; I'd love to know who originally wrote it (based on comments further down in the code it seems to have been Ross Ihaka?), but am having trouble following the code's history back beyond a code reorganization in 1998. (A little more digging suggests that this code has been in R's code base essentially since the beginning of its recorded history, i.e. the file was added in SVN revision 2 1997-09-18 and not modified until much later.)

MartinMächler最近(2016年10月25日,此处)添加了更多信息大约?qr,因此该信息实际上将在下一版R ...的文档中提供.

Martin Mächler recently (2016 Oct 25, here) added more information about to ?qr, so that this information will actually be available in the documentation in the next release of R ...

如果您知道如何将已编译的FORTRAN代码与Python代码链接(我不知道),那么编译src/appl/dqrdc2.f并将lm.fit的勇气转换成Python会很容易:这是,减去错误检查和其他处理...

If you know how to link compiled FORTRAN code with Python code (I don't), it would be pretty easy to compile src/appl/dqrdc2.f and translate the guts of lm.fit into Python: this is the core of lm.fit, minus error-checking and other processing ...

z <- .Call(C_Cdqrls, x, y, tol, FALSE)
coef <- z$coefficients
pivot <- z$pivot
r1 <- seq_len(z$rank)
dn <- colnames(x)
nmeffects <- c(dn[pivot[r1]], rep.int("", n - z$rank))
r2 <- if (z$rank < p) 
    (z$rank + 1L):p
else integer()
if (is.matrix(y)) { ## ... 
}  else {
    coef[r2] <- NA
    if (z$pivoted) 
        coef[pivot] <- coef
    names(coef) <- dn
    names(z$effects) <- nmeffects
}
z$coefficients <- coef

这篇关于R和Python中线性回归的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆