以有效的方式增长一个data.frame [英] Growing a data.frame in a memory-efficient manner
问题描述
根据逐行创建R数据帧使用 rbind
附加到 data.frame
并不理想,因为它创建了整个数据的副本。每一帧。如何在 R
中累积数据,从而导致一个 data.frame
而不会引起此惩罚?中间格式不需要是一个 data.frame
。
According to Creating an R dataframe row-by-row, it's not ideal to append to a data.frame
using rbind
, as it creates a copy of the whole data.frame each time. How do I accumulate data in R
resulting in a data.frame
without incurring this penalty? The intermediate format doesn't need to be a data.frame
.
推荐答案
p> 第一种方法
我尝试访问预先分配的data.frame的每个元素:
I tried accessing each element of a pre-allocated data.frame:
res <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
tracemem(res)
for(i in 1:1000) {
res[i,"x"] <- runif(1)
res[i,"y"] <- rnorm(1)
}
但是tracemem变得疯狂(例如,data.frame被复制到新地址每次)。
But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).
替代方法(不起作用)
一种方法(不确定它是否更快,因为我还没有基准测试)是创建一个data.frames列表,然后 stack
它们在一起:
One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then stack
them all together:
makeRow <- function() data.frame(x=runif(1),y=rnorm(1))
res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames
library(taRifx)
res.df <- stack(res)
不幸的是,创建列表我认为你将很难预先分配。例如:
Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:
> tracemem(res)
[1] "<0x79b98b0>"
> res[[2]] <- data.frame()
tracemem[0x79b98b0 -> 0x71da500]:
换句话说,替换列表的元素会导致列表被复制。我假设整个列表,但它只可能是列表的元素。我不太熟悉R的内存管理细节。
In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.
可能最好的方法
与现在很多速度或内存限制的进程一样,最好的方法可能是使用 data.table
而不是 data.frame
。由于 data.table
具有由引用运算符分配的:=
,它可以在不重新复制的情况下进行更新:
As with many speed or memory-limited processes these days, the best approach may well be to use data.table
instead of a data.frame
. Since data.table
has the :=
assign by reference operator, it can update without re-copying:
library(data.table)
dt <- data.table(x=rep(0,1000), y=rep(0,1000))
tracemem(dt)
for(i in 1:1000) {
dt[i,x := runif(1)]
dt[i,y := rnorm(1)]
}
# note no message from tracemem
$ b $但是,如@MatthewDowle所指出的, set()
是在循环中执行此操作的适当方法。这样做使得它更快:
But as @MatthewDowle points out, set()
is the appropriate way to do this inside a loop. Doing so makes it faster still:
library(data.table)
n <- 10^6
dt <- data.table(x=rep(0,n), y=rep(0,n))
dt.colon <- function(dt) {
for(i in 1:n) {
dt[i,x := runif(1)]
dt[i,y := rnorm(1)]
}
}
dt.set <- function(dt) {
for(i in 1:n) {
set(dt,i,1L, runif(1) )
set(dt,i,2L, rnorm(1) )
}
}
library(microbenchmark)
m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2)
基准
循环运行10,000次,数据表几乎是完全数量级更快:
With the loop run 10,000 times, data table is almost a full order of magnitude faster:
Unit: seconds
expr min lq median uq max
1 test.df() 523.49057 523.49057 524.52408 525.55759 525.55759
2 test.dt() 62.06398 62.06398 62.98622 63.90845 63.90845
3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622
比较:=
与 set()
:
> m
Unit: milliseconds
expr min lq median uq max
1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186
2 dt.set(dt) 13.29612 13.29612 15.02891 16.7617 16.7617
请注意, n
这里是10 ^ 6不如上述基准中的10 ^ 5。所以有一个数量级更多的工作,结果以毫秒为单位,不是秒。令人印象深刻。
Note that n
here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.
这篇关于以有效的方式增长一个data.frame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!