如何使group_by和lm快速? [英] How to make group_by and lm fast?

查看:67
本文介绍了如何使group_by和lm快速?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个示例.

df <- tibble(
      subject = rep(letters[1:7], c(5, 6, 7, 5, 2, 5, 2)),
      day = c(3:7, 2:7, 1:7, 3:7, 6:7, 3:7, 6:7),
      x1 = runif(32), x2 = rpois(32, 3), x3 = rnorm(32), x4 = rnorm(32, 1, 5))

df %>%
  group_by(subject) %>%
  summarise(
    coef_x1 = lm(x1 ~ day)$coefficients[2],
    coef_x2 = lm(x2 ~ day)$coefficients[2],
    coef_x3 = lm(x3 ~ day)$coefficients[2],
    coef_x4 = lm(x4 ~ day)$coefficients[2])

此数据很小,因此性能不是问题.

This data is small, so performance is not problem.

但是我的数据是如此之大,大约有1,000,000行和200,000个主题,并且此代码非常慢.

But my data is so large, approximately 1,000,000 rows and 200,000 subjects, and this code is very slow.

我认为原因不是lm的速度,而是很多主题子设置.

I think the cause is not lm's speed but a lot of subject and subsetting.

推荐答案

理论上

首先,您可以拟合具有多个LHS的线性模型.

第二,显式数据拆分不是group-by回归的唯一方法(或推荐的方法).请参阅 R回归分析:分析特定种族的数据

Secondly, explicit data splitting is not the only way (or a recommended way) for group-by regression. See R regression analysis: analyzing data for a certain ethnicity and R: build separate models for each category. So build your model as cbind(x1, x2, x3, x4) ~ day * subject where subject is a factor variable.

最后,由于您有多个因子水平并且使用大型数据集,因此lm是不可行的.考虑将speedglm::speedlmsparse = TRUE一起使用,或MatrixModels::glm4sparse = TRUE一起使用.

Finally, since you have many factor levels and work with a large dataset, lm is infeasible. Consider using speedglm::speedlm with sparse = TRUE, or MatrixModels::glm4 with sparse = TRUE.

speedlmglm4均未处于积极开发中. (在我看来)它们的功能是原始的.

Neither speedlm nor glm4 is under active development. Their functionality is (in my view) primitive.

speedlmglm4都不支持多个LHS作为lm.因此,您需要做4个单独的模型,从x1 ~ day * subjectx4 ~ day * subject.

Neither speedlm nor glm4 supports multiple LHS as lm. So you need do 4 separate models x1 ~ day * subject to x4 ~ day * subject instead.

这两个软件包在sparse = TRUE之后具有不同的逻辑.

These two packages have different logic behind sparse = TRUE.

  • speedlm首先使用标准model.matrix.default构造一个密集的设计矩阵,然后使用is.sparse来检查其是否稀疏.如果为TRUE,则后续计算可以使用稀疏方法.
  • glm4使用model.Matrix构造设计矩阵,可以直接构建稀疏矩阵.
  • speedlm first constructs a dense design matrix using the standard model.matrix.default, then uses is.sparse to survey whether it is sparse or not. If TRUE, subsequent computations can use sparse methods.
  • glm4 uses model.Matrix to construct a design matrix, and a sparse one can be built directly.

因此,在这个稀疏性问题中speedlmlm一样糟糕并不奇怪,并且glm4是我们真正想要使用的.

So, it is not surprising that speedlm is as bad as lm in this sparsity issue, and glm4 is the one we really want to use.

glm4没有用于分析拟合模型的完整,有用的通用函数集.您可以通过coeffittedresiduals提取系数,拟合值和残差,但必须自己计算所有统计信息(标准误差,t统计量,F统计量等).对于那些非常了解回归理论的人来说,这并不是什么大不了的事情,但是仍然很不方便.

glm4 does not have a full, useful set of generic functions for analyzing fitted models. You can extract coefficients, fitted values and residuals by coef, fitted and residuals, but have to compute all statistics (standard error, t-statistic, F-statistic, etc) all by yourself. This is not a big deal for people who know regression theory well, but it is still rather inconvenient.

glm4仍然希望您使用最佳的模型公式,以便可以构造最稀疏的矩阵.常规的~ day * subject确实不是一个好选择.我可能应该设置一个问题与解答;关于此问题的稍后.基本上,如果您的公式具有截距并且将因素进行了对比,那么您将失去稀疏性.这是我们应该使用的:~ 0 + subject + day:subject.

glm4 still expects you to use the best model formula so that the sparsest matrix can be constructed. The conventional ~ day * subject is really not a good one. I should probably set up a Q & A on this issue later. Basically, if your formula has intercept and factors are contrasted, you loose sparsity. This is the one we should use: ~ 0 + subject + day:subject.

## use chinsoon12's data in his answer
set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
                 day = 3:7,
                 y1 = runif(nr), 
                 y2 = rpois(nr, 3), 
                 y3 = rnorm(nr), 
                 y4 = rnorm(nr, 1, 5))

library(MatrixModels)
fit <- glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE)

在我的1.1GHz Sandy Bridge笔记本电脑上大约需要6到7秒.让我们提取其系数:

It takes about 6 ~ 7 seconds on my 1.1GHz Sandy Bridge laptop. Let's extract its coefficients:

b <- coef(fit)

head(b)
#  subject1   subject2   subject3   subject4   subject5   subject6 
# 0.4378952  0.3582956 -0.2597528  0.8141229  1.3337102 -0.2168463 

tail(b)
#subject199995:day subject199996:day subject199997:day subject199998:day 
#      -0.09916175       -0.15653402       -0.05435883       -0.02553316 
#subject199999:day subject200000:day 
#       0.02322640       -0.09451542 

您可以执行B <- matrix(b, ncol = 2),以使第一列为截距,第二列为斜率.

You can do B <- matrix(b, ncol = 2), so that the 1st column is intercept and the 2nd is slope.

chinsoon12的data.table解决方案相比,此处使用glm4并没有吸引人的优势,因为它基本上也可以告诉您回归系数.它也比data.table方法要慢一些,因为它可以计算拟合值和残差.

Using glm4 here does not give attractive advantage over chinsoon12's data.table solution since it also basically just tells you the regression coefficient. It is also a bit slower than data.table method, because it computes fitted values and residuals.

简单回归分析不需要适当的模型拟合例程.我对如何在这种回归上做一些花哨的东西有一些答案,例如数据框中所有变量之间的快速成对简单线性回归,其中还提供了有关如何计算所有统计信息的详细信息.但是,当我写这个答案时,我正在思考有关大型回归问题的一般性问题.我们可能需要更好的程序包,否则我们将无休止地进行案例编码.

Analysis of simple regression does not require a proper model fitting routine. I have a few answers on how to do fancy stuff on this kind of regression, like Fast pairwise simple linear regression between all variables in a data frame where details on how to compute all statistics are also given. But as I wrote this answer, I was thinking about something in general regarding large regression problem. We might need better packages otherwise there is no end doing case-to-case coding.

speedglm::speedlm(x1 ~ 0 + subject + day:subject, data = df, sparse = TRUE)给出错误:无法分配大小为74.5 Gb的向量

是的,因为它的sparse逻辑不好.

yeah, because it has a bad sparse logic.

MatrixModels::glm4(x1 ~ day * subject, data = df, sparse = TRUE) Cholesky(crossprod(from),LDL = FALSE)中给出错误:internal_chm_factor:Cholesky分解失败

这是因为某些subject只有一个基准.您至少需要两个数据才能拟合一条线.这是一个示例(在密集设置中):

This is because you have only one datum for some subject. You need at least two data to fit a line. Here is an example (in dense settings):

dat <- data.frame(t = c(1:5, 1:9, 1),
                  f = rep(gl(3,1,labels = letters[1:3]), c(5, 9, 1)),
                  y = rnorm(15))

f的级别"c"仅具有一个基准/行.

Level "c" of f only has one datum / row.

X <- model.matrix(~ 0 + f + t:f, dat)
XtX <- crossprod(X)
chol(XtX)
#Error in chol.default(XtX) : 
#  the leading minor of order 6 is not positive definite

Cholesky因式分解无法解析秩不足模型.如果使用lm的QR因式分解,则会看到NA系数.

Cholesky factorization is unable to resolve rank-deficient model. If we use lm's QR factorization, we will see an NA coefficient.

lm(y ~ 0 + f + t:f, dat)
#Coefficients:
#      fa        fb        fc      fa:t      fb:t      fc:t  
# 0.49893   0.52066  -1.90779  -0.09415  -0.03512        NA  

我们只能估计"c"级的截距,而不是斜率.

We can only estimate an intercept for level "c", not a slope.

请注意,如果使用data.table解决方案,则在计算此级别的斜率时会得到0 / 0,最终结果为NaN.

Note that if you use the data.table solution, you end up with 0 / 0 when computing slope of this level, and the final result is NaN.

查看通过简单线性回归和一般成对简单线性回归快速分组.

这篇关于如何使group_by和lm快速?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆