R car :: Anova的更快替代品,用于预测变量子集的平方叉积矩阵求和 [英] Faster alternative to R car::Anova for sum of square crossproduct matrix calculation for subsets of predictors

查看:40
本文介绍了R car :: Anova的更快替代品,用于预测变量子集的平方叉积矩阵求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在Y(n x q)和X(n x p)的多元线性模型中计算叉积平方和(实际上是该矩阵的迹线).为此的标准R代码是:

I need to compute the sum of squares crossproduct matrix (indeed the trace of this matrix) in a multivariate linear model, with Y (n x q) and X (n x p). Standard R code for doing that is:

require(MASS)
require(car)

# Example data 
q <- 10
n  <- 1000
p <- 10
Y <- mvrnorm(n, mu = rep(0, q), Sigma = diag(q))
X <- as.data.frame(mvrnorm(n, mu = rnorm(p), Sigma = diag(p)))

# Fit lm
fit <- lm( Y ~ ., data = X )

# Type I sums of squares
summary(manova(fit))$SS    

# Type III sums of squares
type = 3 # could be also 2 (II)
car::Anova(fit, type = type)$SSP

必须执行数千次,不幸的是,当预测变量的数量相对较大时,它会变慢.通常,我只对 s 个预测变量的子集感兴趣,因此我尝试重新实现此计算.尽管我的实现直接为 s = 1(以下)翻译线性代数对于较小的样本量(n)更快,

This has to be done thousands of times, unfortunately, it gets slow when the number of predictors is relatively large. As often I am interested only in a subset of s predictors, I tried to re-implement this calculation. Although my implementation directly translating linear algebra for s = 1 (below) is faster for small sample sizes (n),

# Hat matrix (X here stands for the actual design matrix)
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)

# Remove predictor of interest (e.g. 2)
X.r <- X[, -2]  
H1 <- tcrossprod(tcrossprod(X.r, solve(crossprod(X.r))), X.r) 

# Compute e.g. type III sum of squares
SS <- crossprod(Y, H - H1) %*% Y

car 对于较大的n来说,运行速度仍然更快:

car still goes faster for large n:

我已经尝试了 Rcpp 的实现,该实现非常成功,因为R中的这些矩阵产品已经使用了非常有效的代码.

I already tried Rcpp implementation which much success, as these matrix products in R already use a very efficient code.

关于如何更快地执行此操作的任何提示?

Any hint on how to do this faster?

更新

阅读答案后,我尝试了此

After reading the answers, I tried the solution proposed in this post which relies on QR/SVD/Cholesky factorization for hat matrix calculation. However it seems that car::Anova is still faster to compute all p = 30 matrices than me computing just one (s = 1)!! for e.g. n = 5000, q = 10:

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
   ME 1137.5692 1202.9888 1257.8979 1251.6834 1318.9282 1398.9343    10
   QR 1005.9082 1031.9911 1084.5594 1037.5659 1095.7449 1364.9508    10
  SVD 1026.8815 1065.4629 1152.6631 1087.9585 1241.4977 1446.8318    10
 Chol  969.9089 1056.3093 1115.9608 1102.1169 1210.7782 1267.1274    10
  CAR  205.1665  211.8523  218.6195  214.6761  222.0973  242.4617    10

更新2

目前最好的解决方案是查看 car::Anova

The best solution for now was to go over the car::Anova code (i.e. functions car:::Anova.III.mlm and subsequently car:::linearHypothesis.mlm) and re-implement them to account for a subset of predictors, instead of all of them.

car 的相关代码如下(我跳过了检查,并简化了一点):

The relevant code by car is as follows (I skipped checks, and simplified a bit):

B <- coef(fit)                    # Model coefficients
M <- model.matrix(fit)            # Model matrix M
V <- solve(crossprod(M))          # M'M
p <- ncol(M)                      # Number of predictors in M
I.p <- diag(p)                    # Identity (p x p)
terms <- labels(terms(fit))       # terms (add intercept)       
terms <- c("(Intercept)", terms)   
n.terms <- length(terms)
assign <- fit$assign              # assignation terms <-> p variables
  
SSP <- as.list(rep(0, n.terms))   # Initialize empty list for sums of squares cross-product matrices
names(SSP) <- terms
  
for (term in 1:n.terms){
    subs <- which(assign == term - 1)
    L <- I.p[subs, , drop = FALSE]
    SSP[[term]] <- t(L %*% B) %*% solve(L %*% V %*% t(L)) %*% (L %*% B)
}

然后,只需选择条款的子集即可.

Then it is just a matter of selecting the subset of terms.

推荐答案

此行以及下面与 H1 相似的行可能会得到改进:

This line and the similar one below it for H1 could probably be improved:

H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)

通常的想法是您应该很少使用 solve(Y)%*%Z ,因为它与 solve(Y,Z)相同,但速度较慢.我还没有完全扩展您的 tcrossprod 调用,以查看 H H1 的最佳等效表达式是什么.

The general idea is that you should rarely use solve(Y) %*% Z, because it is the same as solve(Y, Z) but slower. I haven't fully expanded your tcrossprod calls to see what the best equivalent formulation of the expressions for H and H1 would be.

您还可以查看以下问题

You could also look at this question https://stats.stackexchange.com/questions/139969/speeding-up-hat-matrices-like-xxx-1x-projection-matrices-and-other-as for a description of doing it via QR decomposition.

这篇关于R car :: Anova的更快替代品,用于预测变量子集的平方叉积矩阵求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆