R car :: Anova的更快替代品,用于预测变量子集的平方叉积矩阵求和 [英] Faster alternative to R car::Anova for sum of square crossproduct matrix calculation for subsets of predictors
问题描述
我需要在Y(n x q)和X(n x p)的多元线性模型中计算叉积平方和(实际上是该矩阵的迹线).为此的标准R代码是:
I need to compute the sum of squares crossproduct matrix (indeed the trace of this matrix) in a multivariate linear model, with Y (n x q) and X (n x p). Standard R code for doing that is:
require(MASS)
require(car)
# Example data
q <- 10
n <- 1000
p <- 10
Y <- mvrnorm(n, mu = rep(0, q), Sigma = diag(q))
X <- as.data.frame(mvrnorm(n, mu = rnorm(p), Sigma = diag(p)))
# Fit lm
fit <- lm( Y ~ ., data = X )
# Type I sums of squares
summary(manova(fit))$SS
# Type III sums of squares
type = 3 # could be also 2 (II)
car::Anova(fit, type = type)$SSP
必须执行数千次,不幸的是,当预测变量的数量相对较大时,它会变慢.通常,我只对 s
个预测变量的子集感兴趣,因此我尝试重新实现此计算.尽管我的实现直接为 s
= 1(以下)翻译线性代数对于较小的样本量(n)更快,
This has to be done thousands of times, unfortunately, it gets slow when the number of predictors is relatively large. As often I am interested only in a subset of s
predictors, I tried to re-implement this calculation. Although my implementation directly translating linear algebra for s
= 1 (below) is faster for small sample sizes (n),
# Hat matrix (X here stands for the actual design matrix)
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
# Remove predictor of interest (e.g. 2)
X.r <- X[, -2]
H1 <- tcrossprod(tcrossprod(X.r, solve(crossprod(X.r))), X.r)
# Compute e.g. type III sum of squares
SS <- crossprod(Y, H - H1) %*% Y
car
对于较大的n来说,运行速度仍然更快:
car
still goes faster for large n:
我已经尝试了 Rcpp
的实现,该实现非常成功,因为R中的这些矩阵产品已经使用了非常有效的代码.
I already tried Rcpp
implementation which much success, as these matrix products in R already use a very efficient code.
关于如何更快地执行此操作的任何提示?
Any hint on how to do this faster?
更新
After reading the answers, I tried the solution proposed in this post which relies on QR/SVD/Cholesky factorization for hat matrix calculation. However it seems that car::Anova
is still faster to compute all p = 30 matrices than me computing just one (s = 1)!! for e.g. n = 5000, q = 10:
Unit: milliseconds
expr min lq mean median uq max neval
ME 1137.5692 1202.9888 1257.8979 1251.6834 1318.9282 1398.9343 10
QR 1005.9082 1031.9911 1084.5594 1037.5659 1095.7449 1364.9508 10
SVD 1026.8815 1065.4629 1152.6631 1087.9585 1241.4977 1446.8318 10
Chol 969.9089 1056.3093 1115.9608 1102.1169 1210.7782 1267.1274 10
CAR 205.1665 211.8523 218.6195 214.6761 222.0973 242.4617 10
更新2
The best solution for now was to go over the car::Anova
code (i.e. functions car:::Anova.III.mlm
and subsequently car:::linearHypothesis.mlm
) and re-implement them to account for a subset of predictors, instead of all of them.
car
的相关代码如下(我跳过了检查,并简化了一点):
The relevant code by car
is as follows (I skipped checks, and simplified a bit):
B <- coef(fit) # Model coefficients
M <- model.matrix(fit) # Model matrix M
V <- solve(crossprod(M)) # M'M
p <- ncol(M) # Number of predictors in M
I.p <- diag(p) # Identity (p x p)
terms <- labels(terms(fit)) # terms (add intercept)
terms <- c("(Intercept)", terms)
n.terms <- length(terms)
assign <- fit$assign # assignation terms <-> p variables
SSP <- as.list(rep(0, n.terms)) # Initialize empty list for sums of squares cross-product matrices
names(SSP) <- terms
for (term in 1:n.terms){
subs <- which(assign == term - 1)
L <- I.p[subs, , drop = FALSE]
SSP[[term]] <- t(L %*% B) %*% solve(L %*% V %*% t(L)) %*% (L %*% B)
}
然后,只需选择条款的子集即可.
Then it is just a matter of selecting the subset of terms.
推荐答案
此行以及下面与 H1
相似的行可能会得到改进:
This line and the similar one below it for H1
could probably be improved:
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
通常的想法是您应该很少使用 solve(Y)%*%Z
,因为它与 solve(Y,Z)
相同,但速度较慢.我还没有完全扩展您的 tcrossprod
调用,以查看 H
和 H1
的最佳等效表达式是什么.
The general idea is that you should rarely use solve(Y) %*% Z
, because it is the same as solve(Y, Z)
but slower. I haven't fully expanded your tcrossprod
calls to see what the best equivalent formulation of the expressions for H
and H1
would be.
You could also look at this question https://stats.stackexchange.com/questions/139969/speeding-up-hat-matrices-like-xxx-1x-projection-matrices-and-other-as for a description of doing it via QR decomposition.
这篇关于R car :: Anova的更快替代品,用于预测变量子集的平方叉积矩阵求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!