为什么在矩阵乘法对系数起作用的情况下,lm内存不足? [英] Why does lm run out of memory while matrix multiplication works fine for coefficients?

查看:83
本文介绍了为什么在矩阵乘法对系数起作用的情况下,lm内存不足?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用R进行固定效果线性回归.我的数据看起来像

I am trying to do fixed effects linear regression with R. My data looks like

dte   yr   id   v1   v2
  .    .    .    .    .
  .    .    .    .    .
  .    .    .    .    .

然后,我决定通过将yr用作因子并使用lm来简单地做到这一点:

I then decided to simply do this by making yr a factor and use lm:

lm(v1 ~ factor(yr) + v2 - 1, data = df)

但是,这似乎用尽了内存.我的因素有20个级别,df是1400万行,需要大约2GB的存储空间,我正在为此进程专用的22 GB的计算机上运行它.

However, this seems to run out of memory. I have 20 levels in my factor and df is 14 million rows which takes about 2GB to store, I am running this on a machine with 22 GB dedicated to this process.

然后,我决定尝试老式的方法:通过执行以下操作,为我的t1t20年中的每一年创建虚拟变量:

I then decided to try things the old fashioned way: create dummy variables for each of my years t1 to t20 by doing:

df$t1 <- 1*(df$yr==1)
df$t2 <- 1*(df$yr==2)
df$t3 <- 1*(df$yr==3)
...

并简单地计算:

solve(crossprod(x), crossprod(x,y))

这没有问题,并且几乎立即产生了答案.

This runs without a problem and produces the answer almost right away.

我特别好奇,当我可以很好地计算系数时,lm使内存用尽了什么呢?谢谢.

I am specifically curious what is it about lm that makes it run out of memory when I can compute the coefficients just fine? Thanks.

推荐答案

到目前为止,没有一个答案指出了正确的方向.

None of the answers so far has pointed to the right direction.

@idr 接受的答案使lmsummary.lm之间产生混淆. lm完全不计算诊断统计信息;相反,summary.lm会.所以他在谈论summary.lm.

The accepted answer by @idr is making confusion between lm and summary.lm. lm computes no diagnostic statistics at all; instead, summary.lm does. So he is talking about summary.lm.

@Jake 的答案是关于QR因式分解和LU/Choleksy因式分解的数值稳定性的事实. Aravindakshan 的答案通过指出这两个操作背后的浮点操作数量来扩展了这一范围(尽管他说过,他没有将计算矩阵叉积的费用算在内).但是,请勿将FLOP计数与内存成本混淆.实际上,两种方法在LINPACK/LAPACK中具有相同的内存使用量.具体来说,他认为QR方法花费更多RAM来存储Q因子的论点是虚假的.如 lm()所述,压缩存储:LINPACK/LAPACK中QR分解返回的qraux是什么阐明了QR分解是如何进行的.计算并存储. QR v.s.的速度问题我的答案中详细介绍了Chol:为什么内置lm函数在R中如此慢?,而我对更快lm 使用Choleksy方法提供了一个小的例程lm.chol,它比QR方法快3倍.

@Jake's answer is a fact on the numeric stability of QR factorization and LU / Choleksy factorization. Aravindakshan's answer expands this, by pointing out the amount of floating point operations behind both operations (though as he said, he did not count in the costs for computing matrix cross product). But, do not confuse FLOP counts with memory costs. Actually both method have the same memory usage in LINPACK / LAPACK. Specifically, his argument that QR method costs more RAM to store Q factor is a bogus one. The compacted storage as explained in lm(): What is qraux returned by QR decomposition in LINPACK / LAPACK clarifies how QR factorization is computed and stored. Speed issue of QR v.s. Chol is detailed in my answer: Why the built-in lm function is so slow in R?, and my answer on faster lm provides a small routine lm.chol using Choleksy method, which is 3 times faster than QR method.

@Greg biglm的回答/建议是好的,但它不能回答问题.由于提到了biglm,因此我要指出, QR分解在lmbiglm 中有所不同. biglm计算入户反射,以便得到的R因子具有正对角线.有关详细信息,请参见通过QR分解的胆固醇因子. biglm这样做的原因是,生成的R将与Cholesky因子相同,请参见 QR分解和Choleski R 中的分解以获取信息.另外,除了biglm,您还可以使用mgcv.阅读我的回答: biglm预测无法分配大小为xx.x MB的向量.

@Greg's answer / suggestion for biglm is good, but it does not answer the question. Since biglm is mentioned, I would point out that QR decomposition differs in lm and biglm. biglm computes householder reflection so that the resulting R factor has positive diagonals. See Cholesky factor via QR factorization for details. The reason that biglm does this, is that the resulting R will be as same as the Cholesky factor, see QR decomposition and Choleski decomposition in R for information. Also, apart from biglm, you can use mgcv. Read my answer: biglm predict unable to allocate a vector of size xx.x MB for more.

总结之后,该发布我的答案了.

为了拟合线性模型,lm

  1. 生成模型框架;
  2. 生成模型矩阵;
  3. 致电lm.fit进行QR分解;
  4. 返回QR因式分解的结果以及lmObject中的模型框架.
  1. generates a model frame;
  2. generates a model matrix;
  3. call lm.fit for QR factorization;
  4. returns the result of QR factorization as well as the model frame in lmObject.

您说5列的输入数据帧的存储成本为2 GB.在20个因子级别下,生成的模型矩阵大约有25列,占用10 GB的存储空间.现在让我们看看调用lm时内存使用量如何增长.

You said your input data frame with 5 columns costs 2 GB to store. With 20 factor levels the resulting model matrix has about 25 columns taking 10 GB storage. Now let's see how memory usage grows when we call lm.

  • [全局环境] 最初,数据帧有2 GB的存储空间;
  • [lm环境] ,然后将其复制到模型框架,成本2 GB;
  • [lm环境] ,然后生成一个模型矩阵,成本为10 GB;
  • [lm.fit环境] ,然后复制模型矩阵并通过QR分解覆盖,费用为10 GB;
  • [lm环境] 返回了lm.fit的结果,花费10 GB;
  • [全局环境] lm进一步返回lm.fit的结果,另外花费10 GB;
  • [全局环境] ,模型框架由lm返回,花费2 GB.
  • [global environment] initially you have 2 GB storage for the data frame;
  • [lm envrionment] then it is copied to a model frame, costing 2 GB;
  • [lm environment] then a model matrix is generated, costing 10 GB;
  • [lm.fit environment] a copy of model matrix is made then overwritten by QR factorization, costing 10 GB;
  • [lm environment] the result of lm.fit is returned, costing 10 GB;
  • [global environment] the result of lm.fit is further returned by lm, costing another 10 GB;
  • [global environment] the model frame is returned by lm, costing 2 GB.

因此,总共需要46 GB RAM,远远大于可用的22 GB RAM.

So, a total of 46 GB RAM is required, far greater than your available 22 GB RAM.

实际上,如果可以将lm.fit内联"到lm中,则可以节省20 GB的成本.但是无法在另一个R函数中内联R函数.

Actually if lm.fit can be "inlined" into lm, we could save 20 GB costs. But there is no way to inline an R function in another R function.

也许我们可以举一个小例子来看看lm.fit周围的情况:

Maybe we can take a small example to see what happens around lm.fit:

X <- matrix(rnorm(30), 10, 3)    # a `10 * 3` model matrix
y <- rnorm(10)    ## response vector

tracemem(X)
# [1] "<0xa5e5ed0>"

qrfit <- lm.fit(X, y)
# tracemem[0xa5e5ed0 -> 0xa1fba88]: lm.fit 

实际上,X是在传递到lm.fit中时复制的.让我们看看qrfit有什么

So indeed, X is copied when passed into lm.fit. Let's have a look at what qrfit has

str(qrfit)
#List of 8
# $ coefficients : Named num [1:3] 0.164 0.716 -0.912
#  ..- attr(*, "names")= chr [1:3] "x1" "x2" "x3"
# $ residuals    : num [1:10] 0.4 -0.251 0.8 -0.966 -0.186 ...
# $ effects      : Named num [1:10] -1.172 0.169 1.421 -1.307 -0.432 ...
#  ..- attr(*, "names")= chr [1:10] "x1" "x2" "x3" "" ...
# $ rank         : int 3
# $ fitted.values: num [1:10] -0.466 -0.449 -0.262 -1.236 0.578 ...
# $ assign       : NULL
# $ qr           :List of 5
#  ..$ qr   : num [1:10, 1:3] -1.838 -0.23 0.204 -0.199 0.647 ...
#  ..$ qraux: num [1:3] 1.13 1.12 1.4
#  ..$ pivot: int [1:3] 1 2 3
#  ..$ tol  : num 1e-07
#  ..$ rank : int 3
#  ..- attr(*, "class")= chr "qr"
# $ df.residual  : int 7

请注意,紧凑型QR矩阵qrfit$qr$qr与模型矩阵X一样大.它是在lm.fit内部创建的,但是在lm.fit的退出位置被复制.因此,总共,我们将有3个X的副本":

Note that the compact QR matrix qrfit$qr$qr is as large as model matrix X. It is created inside lm.fit, but on exit of lm.fit, it is copied. So in total, we will have 3 "copies" of X:

  • 全球环境中的原始人;
  • 被复制到lm.fit中的那个,被QR因式分解覆盖了;
  • lm.fit返回的那个.
  • the original one in global environment;
  • the one copied into lm.fit, the overwritten by QR factorization;
  • the one returned by lm.fit.

在您的情况下,X为10 GB,因此仅与lm.fit相关的内存成本已为30 GB.更不用说与lm相关的其他成本了.

In your case, X is 10 GB, so the memory costs associated with lm.fit alone is already 30 GB. Let alone other costs associated with lm.

另一方面,让我们看看

solve(crossprod(X), crossprod(X,y))

X占用10 GB,但crossprod(X)只是一个25 * 25矩阵,而crossprod(X,y)只是一个长度为25的向量.与X相比,它们是如此之小,因此内存使用量根本不会增加.

X takes 10 GB, but crossprod(X) is only a 25 * 25 matrix, and crossprod(X,y) is just a length-25 vector. They are so tiny compared with X, thus memory usage does not increase at all.

也许您担心调用crossprod时会生成X的本地副本吗?一点也不!与同时执行对X的读取和写入的lm.fit不同,crossprod仅读取X,因此不进行任何复制.我们可以通过以下方式用玩具矩阵X进行验证:

Maybe you are worried that a local copy of X will be made when crossprod is called? Not at all! Unlike lm.fit which performs both read and write to X, crossprod only reads X, so no copy is made. We can verify this with our toy matrix X by:

tracemem(X)
crossprod(X)

您将看不到任何复制消息!

You will see no copying message!

如果您想对以上所有内容做一个简短的总结,请输入以下内容:

  • lm.fit(X, y)(甚至.lm.fit(X, y))的内存成本是solve(crossprod(X), crossprod(X,y))的三倍;
  • 取决于模型矩阵比模型框架大多少,lm的存储成本是solve(crossprod(X), crossprod(X,y))的3〜6倍.当模型矩阵与模型框架相同时,永远不会达到下限3,而会达到上限6.在没有诸如bs()poly()等之类的因素变量或类似因素"的术语的情况下就是这种情况.
  • memory costs for lm.fit(X, y) (or even .lm.fit(X, y)) is three times as large as that for solve(crossprod(X), crossprod(X,y));
  • Depending on how much larger the model matrix is than the model frame, memory costs for lm is 3 ~ 6 times as large as that for solve(crossprod(X), crossprod(X,y)). The lower bound 3 is never reached, while the upper bound 6 is reached when the model matrix is as same as the model frame. This is the case when there is no factor variables or "factor-alike" terms, like bs() and poly(), etc.

这篇关于为什么在矩阵乘法对系数起作用的情况下,lm内存不足?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆