以有效的方式增长一个data.frame [英] Growing a data.frame in a memory-efficient manner

查看:150
本文介绍了以有效的方式增长一个data.frame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据逐行创建R数据帧使用 rbind 附加到 data.frame 并不理想,因为它创建了整个数据的副本。每一帧。如何在 R 中累积数据,从而导致一个 data.frame 而不会引起此惩罚?中间格式不需要是一个 data.frame

According to Creating an R dataframe row-by-row, it's not ideal to append to a data.frame using rbind, as it creates a copy of the whole data.frame each time. How do I accumulate data in R resulting in a data.frame without incurring this penalty? The intermediate format doesn't need to be a data.frame.

推荐答案

p> 第一种方法

我尝试访问预先分配的data.frame的每个元素:

I tried accessing each element of a pre-allocated data.frame:

res <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
tracemem(res)
for(i in 1:1000) {
  res[i,"x"] <- runif(1)
  res[i,"y"] <- rnorm(1)
}

但是tracemem变得疯狂(例如,data.frame被复制到新地址每次)。

But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).

替代方法(不起作用)

一种方法(不确定它是否更快,因为我还没有基准测试)是创建一个data.frames列表,然后 stack 它们在一起:

One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then stack them all together:

makeRow <- function() data.frame(x=runif(1),y=rnorm(1))
res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames
library(taRifx)
res.df <- stack(res)

不幸的是,创建列表我认为你将很难预先分配。例如:

Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:

> tracemem(res)
[1] "<0x79b98b0>"
> res[[2]] <- data.frame()
tracemem[0x79b98b0 -> 0x71da500]: 

换句话说,替换列表的元素会导致列表被复制。我假设整个列表,但它只可能是列表的元素。我不太熟悉R的内存管理细节。

In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.

可能最好的方法

与现在很多速度或内存限制的进程一样,最好的方法可能是使用 data.table 而不是 data.frame 。由于 data.table 具有由引用运算符分配的:= ,它可以在不重新复制的情况下进行更新:


As with many speed or memory-limited processes these days, the best approach may well be to use data.table instead of a data.frame. Since data.table has the := assign by reference operator, it can update without re-copying:

library(data.table)
dt <- data.table(x=rep(0,1000), y=rep(0,1000))
tracemem(dt)
for(i in 1:1000) {
  dt[i,x := runif(1)]
  dt[i,y := rnorm(1)]
}
# note no message from tracemem


$ b $但是,如@MatthewDowle所指出的, set()是在循环中执行此操作的适当方法。这样做使得它更快:

But as @MatthewDowle points out, set() is the appropriate way to do this inside a loop. Doing so makes it faster still:

library(data.table)
n <- 10^6
dt <- data.table(x=rep(0,n), y=rep(0,n))

dt.colon <- function(dt) {
  for(i in 1:n) {
    dt[i,x := runif(1)]
    dt[i,y := rnorm(1)]
  }
}

dt.set <- function(dt) {
  for(i in 1:n) {
    set(dt,i,1L, runif(1) )
    set(dt,i,2L, rnorm(1) )
  }
}

library(microbenchmark)
m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2)

基准

循环运行10,000次,数据表几乎是完全数量级更快:

With the loop run 10,000 times, data table is almost a full order of magnitude faster:

Unit: seconds
          expr        min         lq     median         uq        max
1    test.df()  523.49057  523.49057  524.52408  525.55759  525.55759
2    test.dt()   62.06398   62.06398   62.98622   63.90845   63.90845
3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622

比较:= set()

> m
Unit: milliseconds
          expr       min        lq    median       uq      max
1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186
2   dt.set(dt)  13.29612  13.29612  15.02891  16.7617  16.7617

请注意, n 这里是10 ^ 6不如上述基准中的10 ^ 5。所以有一个数量级更多的工作,结果以毫秒为单位,不是秒。令人印象深刻。

Note that n here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.

这篇关于以有效的方式增长一个data.frame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆