R中的线性回归而不将数据复制到内存中? [英] linear regression in R without copying data in memory?
问题描述
进行线性回归的标准方法如下:
The standard way of doing a linear regression is something like this:
l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)
,然后使用predict(l, new_data)
进行预测,其中new_data是具有与公式匹配的列的数据框.但是lm()
返回一个lm
对象,该对象列表包含一些垃圾内容,这些东西在大多数情况下都是无关紧要的.这包括原始数据的副本,以及一堆命名的向量和数组,这些向量和数组的长度/大小为数据:
and then use predict(l, new_data)
to make predictions, where new_data is a dataframe with columns matching the formula. But lm()
returns an lm
object, which is a list that contains crap-loads of stuff that is mostly irrelevant in most situations. This includes a copy of the original data, and a bunch of named vectors and arrays the length/size of the data:
R> str(l)
List of 12
$ coefficients : Named num [1:3] 3.587 -0.257 0.364
..- attr(*, "names")= chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
$ residuals : Named num [1:150] 0.2 -0.3 -0.126 -0.174 0.3 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ effects : Named num [1:150] -37.445 -2.279 -0.914 -0.164 0.313 ...
..- attr(*, "names")= chr [1:150] "(Intercept)" "Petal.Length" "Petal.Width" "" ...
$ rank : int 3
$ fitted.values: Named num [1:150] 3.3 3.3 3.33 3.27 3.3 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ assign : int [1:3] 0 1 2
$ qr :List of 5
..$ qr : num [1:150, 1:3] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
.. ..- attr(*, "assign")= int [1:3] 0 1 2
..$ qraux: num [1:3] 1.08 1.1 1.01
..$ pivot: int [1:3] 1 2 3
..$ tol : num 1e-07
..$ rank : int 3
..- attr(*, "class")= chr "qr"
$ df.residual : int 147
$ xlevels : Named list()
$ call : language lm(formula = Sepal.Width ~ Petal.Length + Petal.Width, data = iris)
$ terms :Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
.. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
.. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
.. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
.. ..- attr(*, "order")= int [1:2] 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
$ model :'data.frame': 150 obs. of 3 variables:
..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
..- attr(*, "terms")=Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
.. .. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
.. .. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
.. .. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
.. .. ..- attr(*, "order")= int [1:2] 1 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
- attr(*, "class")= chr "lm"
这些东西占用了大量空间,并且lm
对象最终比原始数据集大了一个数量级:
That stuff takes up a lot of space, and the lm
object ends up being almost an order of magnitude larger than the original dataset:
R> object.size(iris)
7088 bytes
R> object.size(l)
52704 bytes
这对于这么小的数据集来说不是问题,但是对于产生450mb lm
对象的170Mb数据集来说,这确实是个问题.即使将所有返回选项设置为false,lm对象仍然是原始数据集的5倍:
This isn't a problem with a dataset as small as that, but it can be really problematic with a 170Mb dataset that produces a 450mb lm
object. Even with all the return options set to false, the lm object is still 5 times the original dataset:
R> ls <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris, model=FALSE, x=FALSE, y=FALSE, qr=FALSE)
R> object.size(ls)
30568 bytes
有什么方法可以在R中拟合模型,然后能够预测新输入数据的输出值,无需存储大量多余的不必要数据?换句话说,有没有一种方法可以只是存储模型系数,但是仍然能够使用这些系数对新数据进行预测?
Is there any way of fitting a model in R, and then being able to predict output values on new input data, without storing crap tonnes of extra unnecessary data? In other words, is there a way to just store the model coefficients, but still be able to use those coefficients to predict on new data?
我想,除了不存储所有多余的数据外,我还对使用lm的一种方式非常感兴趣,这样它甚至不计算该数据-这只是在浪费CPU时间...
I guess, as well as not storing all that excess data, I'm also really interested in a way of using lm so that it doesn't even calculate that data - it's just wasted CPU time...
推荐答案
You can use biglm
:
m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
由于biglm
不会将数据存储在输出对象中,因此在进行预测时需要提供数据:
Since biglm
does not store the data in the output object you need to provide your data when making predictions:
p <- predict(m, newdata=iris)
biglm
使用的数据量与参数数量成正比:
The amount of data biglm
uses is proportional to the number of parameters:
> object.size(m)
6720 bytes
> d <- rbind(iris, iris)
> m <- biglm(Sepal.Width ~ Petal.Length + Petal.Width, data=d)
> object.size(m)
6720 bytes
biglm
还允许您使用update
方法使用新的数据块更新模型.使用此方法,当完整的数据集不适合内存时,您还可以估算模型.
biglm
also allows you to update the model with a new chunk of data using the update
method. Using this you can also estimate models when the complete dataset does not fit in memory.
这篇关于R中的线性回归而不将数据复制到内存中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!