当我对一个观测值进行回归时,为什么`fastLm()`返回结果? [英] Why does `fastLm()` return results when I run a regression with one observation?

查看:150
本文介绍了当我对一个观测值进行回归时,为什么`fastLm()`返回结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我用一个观察值进行回归分析时,为什么 fastLm()返回结果吗?

Why does fastLm() return results when I run regressions with one observation?

在下面,为什么 lm() fastLm()结果为什么不相等?

In the following, why aren't the lm() and fastLm() results equal?

library(Rcpp)
library(RcppArmadillo)
library(data.table)
set.seed(1)
DT <- data.table(y = rnorm(5), x1 = rnorm(5), x2 = rnorm(5), my.key = 1:5)
#             y         x1         x2 my.key
# 1: -0.6264538 -0.8204684  1.5117812      1
# 2:  0.1836433  0.4874291  0.3898432      2
# 3: -0.8356286  0.7383247 -0.6212406      3
# 4:  1.5952808  0.5757814 -2.2146999      4
# 5:  0.3295078 -0.3053884  1.1249309      5

lm(y ~ 1 + x1 + x2, data = DT[my.key == 1])
# Coefficients:
# (Intercept)           x1           x2  
#     -0.6265           NA           NA

fastLm(X = model.matrix(y ~ 1 + x1 + x2, data = DT[my.key == 1]), y = DT[my.key == 1]$y)
# Coefficients:
# (Intercept)          x1          x2 
#    -0.15825     0.12984    -0.23924 

model.matrix(y ~ 1 + x1 + x2, data = DT[my.key == 1])
#   (Intercept)        x1       x2
#             1 -0.8204684 1.511781
# attr(,"assign")
# [1] 0 1 2

DT[my.key == 1]$y
# [1] -0.6264538

当我使用整个 DT 时,结果是相等的:

When I use the whole DT the results are equal:

 all.equal(fastLm(X = model.matrix(y ~ 1 + x1 + x2, data = DT), y = DT$y)$coef, 
           lm.fit(x = model.matrix(y ~ 1 + x1 + x2, data = DT), y = DT$y)$coef)
# [1] TRUE

使用修改后的

From the RcppArmadillo library with a modified fLmTwoCasts I also get the same behavior:

src <- '
Rcpp::List fLmTwoCastsOnlyCoefficients(Rcpp::NumericMatrix Xr, Rcpp::NumericVector yr) {
    int n = Xr.nrow(), k = Xr.ncol();
    arma::mat X(Xr.begin(), n, k, false);
    arma::colvec y(yr.begin(), yr.size(), false);
    arma::colvec coef = arma::solve(X, y);
    return Rcpp::List::create(Rcpp::Named("coefficients")=trans(coef));
}
'
cppFunction(code=src, depends="RcppArmadillo")

XX <- model.matrix(y ~ 1 + x1 + x2, data = DT[my.key == 1])
YY <- DT[my.key == 1]$y
fLmTwoCastsOnlyCoefficients(XX, YY)$coef
#            [,1]      [,2]       [,3]
# [1,] -0.1582493 0.1298386 -0.2392384

使用整个 DT ,系数应该相同:

Using the whole DT the coefficients are identical as they should:

lm(y ~ 1 + x1 + x2, data = DT)$coef
# (Intercept)          x1          x2 
#   0.2587933  -0.7709158  -0.6648270

XX.whole <- model.matrix(y ~ 1 + x1 + x2, data = DT)
YY.whole <- DT$y
fLmTwoCastsOnlyCoefficients(XX.whole, YY.whole)
#           [,1]       [,2]      [,3]
# [1,] 0.2587933 -0.7709158 -0.664827

推荐答案

因为 fastLm 无需担心排名不足;这是您为速度付出的代价的一部分.

Because fastLm doesn't worry about rank-deficiency; this is part of the price you pay for speed.

来自?fastLm :

... Armadillo可以比stats包中的功能更快地执行lm.fit之类的原因是因为Armadillo使用了QR分解的Lapack版本,而stats包使用了经过修改的Linpack版本.因此,犰狳使用3级BLAS代码,而stats软件包使用1级BLAS代码.但是,Armadillo可能会失败,或者更糟的是,在秩不足的模型矩阵上会产生完全错误的答案,而由于修改了Linpack代码,因此stats包中的函数将正确处理它们.

... The reason that Armadillo can do something like lm.fit faster than the functions in the stats package is because Armadillo uses the Lapack version of the QR decomposition while the stats package uses a modified Linpack version. Hence Armadillo uses level-3 BLAS code whereas the stats package uses level-1 BLAS. However, Armadillo will either fail or, worse, produce completely incorrect answers on rank-deficient model matrices whereas the functions from the stats package will handle them properly due to the modified Linpack code.

查看代码此处,代码的胆量是

Looking at the code here, the guts of the code are

 arma::colvec coef = arma::solve(X, y);

这将执行QR分解.我们可以将 lmFast 结果与基本R中的 qr()相匹配(在这里,我不仅仅使用基本R结构,还依赖于 data.table ):

This does a QR decomposition. We can match the lmFast results with qr() from base R (here I am not using only base R constructs rather than relying on data.table):

set.seed(1)
dd <- data.frame(y = rnorm(5), 
      x1 = rnorm(5), x2 = rnorm(5), my.key = 1:5)

X <- model.matrix(~1+x1+x2, data=subset(dd,my.key==1))
qr(X,dd$y)
## $qr
##   (Intercept)         x1       x2
## 1           1 -0.8204684 1.511781

您可以查看 lm.fit()的代码,以了解R在拟合线性模型时对秩不足的影响;它所调用的基本BLAS算法可进行枢轴旋转QR ...

You can look at the code of lm.fit() to see what R does about rank deficiency when fitting linear models; the underlying BLAS algorithm it calls does QR with pivoting ...

如果您想标记这些情况,我认为 Matrix :: rankMatrix()可以解决问题:

If you want to flag these situations, I think that Matrix::rankMatrix() will do the trick:

library(Matrix)
rankMatrix(X) < ncol(X)  ## TRUE
X1 <- model.matrix(~1+x1+x2, data=dd)
rankMatrix(X1) < ncol(X1)  ## FALSE

这篇关于当我对一个观测值进行回归时,为什么`fastLm()`返回结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆