R和Python中线性回归的差异 [英] Differences in Linear Regression in R and Python
问题描述
我试图将线性回归R结果与python的结果进行匹配
I was trying to match the linear regression R results with that of python
匹配每个自变量的系数
以及下面的代码:
数据已上传.
https://www.dropbox.com/s/oowe4irm9332s78 /X.csv?dl=0
https://www.dropbox.com/s/79scp54unzlbwyk /Y.csv?dl=0
R代码:
#define pathname = " "
X <- read.csv(file.path(pathname,"X.csv"),stringsAsFactors = F)
Y <- read.csv(file.path(pathname,"Y.csv"),stringsAsFactors = F)
d1 = Y$d1
reg_data <- cbind(d1, X)
head(reg_data)
reg_model <- lm(d1 ~ .,data=reg_data)
names(which(is.na(reg_model$coefficients)))
reg_model$coefficients
R结果
> summary(reg_model)
Call:
lm(formula = d1 ~ ., data = reg_data)
Residuals:
ALL 60 residuals are 0: no residual degrees of freedom!
Coefficients: (14 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -67.37752 NA NA NA
v1 1.30214 NA NA NA
v2 -2.93118 NA NA NA
v3 7.58902 NA NA NA
v4 11.88570 NA NA NA
v5 1.60622 NA NA NA
v6 3.71528 NA NA NA
v7 -9.34627 NA NA NA
v8 -3.84694 NA NA NA
v9 -2.51332 NA NA NA
v10 4.22403 NA NA NA
v11 -9.70126 NA NA NA
v12 NA NA NA NA
v13 4.67276 NA NA NA
v14 -6.57924 NA NA NA
v15 -3.68065 NA NA NA
v16 5.25168 NA NA NA
v17 14.60444 NA NA NA
v18 16.00679 NA NA NA
v19 24.79622 NA NA NA
v20 13.85774 NA NA NA
v21 2.16022 NA NA NA
v22 -36.65361 NA NA NA
v23 2.26554 NA NA NA
v24 NA NA NA NA
v25 NA NA NA NA
v26 7.00981 NA NA NA
v27 0.88904 NA NA NA
v28 0.34400 NA NA NA
v29 -5.27597 NA NA NA
v30 5.21034 NA NA NA
v31 6.79640 NA NA NA
v32 2.96346 NA NA NA
v33 -1.52702 NA NA NA
v34 -2.74632 NA NA NA
v35 -2.36952 NA NA NA
v36 -7.76547 NA NA NA
v37 2.19630 NA NA NA
v38 1.63336 NA NA NA
v39 0.69485 NA NA NA
v40 0.37379 NA NA NA
v41 -0.09107 NA NA NA
v42 2.06569 NA NA NA
v43 1.57505 NA NA NA
v44 2.70535 NA NA NA
v45 1.17634 NA NA NA
v46 -10.51141 NA NA NA
v47 -1.15060 NA NA NA
v48 2.87353 NA NA NA
v49 3.37740 NA NA NA
v50 -5.89816 NA NA NA
v51 0.85851 NA NA NA
v52 3.73929 NA NA NA
v53 4.93265 NA NA NA
v54 3.45650 NA NA NA
v55 0.12382 NA NA NA
v56 -0.21171 NA NA NA
v57 4.37199 NA NA NA
v58 3.21456 NA NA NA
v59 0.09012 NA NA NA
v60 -0.85414 NA NA NA
v61 -3.29856 NA NA NA
v62 4.38842 NA NA NA
v63 NA NA NA NA
v64 NA NA NA NA
v65 NA NA NA NA
v66 NA NA NA NA
v67 NA NA NA NA
v68 NA NA NA NA
v69 NA NA NA NA
v70 NA NA NA NA
v71 NA NA NA NA
v72 NA NA NA NA
v73 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 59 and 0 DF, p-value: NA
Python代码:
Y = pd.read_csv(pathname+"Y.csv")
X = pd.read_csv(pathname+"X.csv")
lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lr.fit(X, Y['d1'])
(list(zip(lr.coef_, X)))
lr.intercept_
Python结果:
intercept = 29.396033164254106
[(-2.4463986167806304, 'v1'),
(-1.6293010275307021, 'v2'),
(0.89089949009506508, 'v3'),
(-3.1021251646895251, 'v4'),
(-1.7707078771936109, 'v5'),
(-2.0474705122225636, 'v6'),
(-1.5537181337496202, 'v7'),
(-1.6391241229716156, 'v8'),
(-1.2981646048517046, 'v9'),
(0.89221826294889328, 'v10'),
(-0.56694104645951571, 'v11'),
(2.042810365310288e-14, 'v12'),
(-2.0312478672439052, 'v13'),
(-1.5617121392788413, 'v14'),
(0.4583365939498274, 'v15'),
(0.8840538748922967, 'v16'),
(-5.5952681002058871, 'v17'),
(2.4937042448512892, 'v18'),
(0.45806845189176543, 'v19'),
(-1.1648810657830406, 'v20'),
(-1.7800004329275585, 'v21'),
(-5.0132817522704816, 'v22'),
(3.6862778096189266, 'v23'),
(2.7533531010703882e-14, 'v24'),
(1.2150003225741557e-14, 'v25'),
(0.94669823515018103, 'v26'),
(-0.3082823207975679, 'v27'),
(0.53619247380957358, 'v28'),
(-1.1339902793546781, 'v29'),
(1.9657159583080186, 'v30'),
(-0.63200501460653324, 'v31'),
(1.4741013580918978, 'v32'),
(-2.4448418291953313, 'v33'),
(-2.0787115960875036, 'v34'),
(0.22492914212063603, 'v35'),
(-0.75136276693004922, 'v36'),
(1.2838658951186761, 'v37'),
(0.5816277993227944, 'v38'),
(-0.11270569554555088, 'v39'),
(-0.13430982360936233, 'v40'),
(-3.3189296496897662, 'v41'),
(-0.452575588270415, 'v42'),
(6.1329755709937519, 'v43'),
(0.61559185634634817, 'v44'),
(-1.206315459828555, 'v45'),
(-3.7452010299772009, 'v46'),
(-1.1472174665136678, 'v47'),
(2.8960489381172652, 'v48'),
(0.0090220136972478659, 'v49'),
(-5.264918363314754, 'v50'),
(1.2194758337662015, 'v51'),
(2.78655271320092, 'v52'),
(3.106513852668896, 'v53'),
(3.5181252502607929, 'v54'),
(-0.34426523770507278, 'v55'),
(-0.48792823932479878, 'v56'),
(0.12284460490031779, 'v57'),
(1.6860388628044991, 'v58'),
(1.2823067194737174, 'v59'),
(2.8352263554153665, 'v60'),
(-1.304336378501032, 'v61'),
(0.55226132316435139, 'v62'),
(1.5416988124754771, 'v63'),
(-0.2605804175310813, 'v64'),
(1.2489066081702334, 'v65'),
(-0.44469553013696161, 'v66'),
(-1.4102990055550157, 'v67'),
(3.8150423259022639, 'v68'),
(0.12039684410168072, 'v69'),
(-1.340699466779357, 'v70'),
(1.7066389124439212, 'v71'),
(0.50470752944860442, 'v72'),
(1.0024872633969766, 'v73')]
但是不匹配.请帮忙.
注意:它与以下示例匹配
Note:It matched for below example
http://davidcoallier.com/blog/linear-从r回归到python/
推荐答案
tl; dr 如果您想在Python中复制R的策略,您可能必须自己将其实现为R做一些其他地方没有广泛使用的聪明东西.
tl;dr if you want to replicate R's strategy in Python you're probably going to have to implement it yourself, as R does some clever stuff that's not widely available elsewhere.
作为参考(由于到目前为止仅在注释中提到),这是一个不适定/秩不足的拟合,当预测变量比响应多时,总是会发生这种情况(p> n:在这种情况下,p = 73 ,n = 61),并且通常是在存在很多分类响应和/或实验设计受到某种方式限制的情况下.处理这些情况以便获得答案,这意味着任何事情通常都需要认真思考和/或先进的技术(例如惩罚性回归:请参见
For reference (since mentioned so far only in comments), this is an ill-posed/rank-deficient fit, which will always happen when there are more predictor variables than responses (p>n: in this case p=73, n=61), and often when there are many categorical responses and/or the experimental design is limited in some way. Dealing with these situations so as to get an answer that means anything at all typically requires careful thought and/or advanced techniques (e.g. penalized regression: see refs to lasso and ridge regression in the Wikipedia article on linear regression).
处理这种情况的最幼稚的方法是将其全部放入标准线性代数中,并希望没有任何坏的情况,这显然是python的statsmodels包所做的事情:来自
The most naive way to handle this situation is to throw it all in to the standard linear algebra and hope that nothing breaks too badly, which is apparently what python's statsmodels package does: from the pitfalls document:
- 矩阵不足的矩阵不会引发错误.
- 几乎具有理想的多重共线性或条件不佳的设计矩阵的情况可能会产生数值不稳定的结果.如果这不是所需的行为,则用户需要手动检查矩阵的等级或条件编号.
- Rank deficient matrices will not raise an error.
- Cases of almost perfect multicollinearity or ill-conditioned design matrices might produce numerically unstable results. Users need to manually check the rank or condition number of the matrix if this is not the desired behavior.
下一个最好的事情(当共线性度为小时,这是合理的)是在进行线性代数时合理地旋转,即重新安排计算问题,以便可以保留共线部分出去.这就是R所做的;为此,R代码的作者必须修改标准LINPACK例程.
The next best thing (reasonable when there is a small degree of collinearity) is to pivot sensibly when doing the linear algebra, that is, rearrange the computational problem so that the collinear parts can be left out. That's what R does; in order to do so, the authors of the R code had to modify the standard LINPACK routines.
基础代码带有此注释/说明:
c dqrdc2使用户主变换来计算qr
c通过p矩阵x对n进行因子分解.有限的列
基于归约列的2-范数的c透视策略
c将范数接近零的列移到
的右边缘 c x矩阵.这种策略意味着顺序的
c自由度效应可以自然地计算出来.
c dqrdc2 uses householder transformations to compute the qr
c factorization of an n by p matrix x. a limited column
c pivoting strategy based on the 2-norms of the reduced columns
c moves columns with near-zero norm to the right-hand edge of
c the x matrix. this strategy means that sequential one
c degree-of-freedom effects can be computed in a natural way.
我非常担心以这种方式修改linpack代码.
c如果您是计算线性代数大师并且您真的
c了解如何解决此问题,请随时
c建议对此代码进行改进.
c i am very nervous about modifying linpack code in this way.
c if you are a computational linear algebra guru and you really
c understand how to solve this problem please feel free to
c suggest improvements to this code.
此代码(和注释)自1998年以来一直存在于R的代码库中;我很想知道是谁最初写的(基于
This code (and comment) have been in R's code base since 1998; I'd love to know who originally wrote it (based on comments further down in the code it seems to have been Ross Ihaka?), but am having trouble following the code's history back beyond a code reorganization in 1998. (A little more digging suggests that this code has been in R's code base essentially since the beginning of its recorded history, i.e. the file was added in SVN revision 2 1997-09-18 and not modified until much later.)
MartinMächler最近(2016年10月25日,此处)添加了更多信息大约?qr
,因此该信息实际上将在下一版R ...的文档中提供.
Martin Mächler recently (2016 Oct 25, here) added more information about to ?qr
, so that this information will actually be available in the documentation in the next release of R ...
如果您知道如何将已编译的FORTRAN代码与Python代码链接(我不知道),那么编译src/appl/dqrdc2.f
并将lm.fit
的勇气转换成Python会很容易:这是
If you know how to link compiled FORTRAN code with Python code (I don't), it would be pretty easy to compile src/appl/dqrdc2.f
and translate the guts of lm.fit
into Python: this is the core of lm.fit
, minus error-checking and other processing ...
z <- .Call(C_Cdqrls, x, y, tol, FALSE)
coef <- z$coefficients
pivot <- z$pivot
r1 <- seq_len(z$rank)
dn <- colnames(x)
nmeffects <- c(dn[pivot[r1]], rep.int("", n - z$rank))
r2 <- if (z$rank < p)
(z$rank + 1L):p
else integer()
if (is.matrix(y)) { ## ...
} else {
coef[r2] <- NA
if (z$pivoted)
coef[pivot] <- coef
names(coef) <- dn
names(z$effects) <- nmeffects
}
z$coefficients <- coef
这篇关于R和Python中线性回归的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!