最高效的列表到data.frame方法? [英] Most efficient list to data.frame method?

查看:223
本文介绍了最高效的列表到data.frame方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

刚刚和同事谈过这个话题,我们认为值得看看人们在SO土地上不得不说的话。假设我有一个包含N个元素的列表,其中每个元素都是长度为X的向量。现在假设我想将它转换成一个data.frame。和R中的大多数东西一样,使用plyr包,组合 do,可以使用多种方法来剥皮这种特殊的猫,例如 as.dataframe 。使用 cbind 调用,预分配DF并填充它,等等。

Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as as.dataframe, using the plyr package, comboing do.call with cbind, pre-allocating the DF and filling it in, and others.

提出的问题是当N或X(在我们的例子中是X)变得非常大时会发生什么。是否有一种猫皮肤方法,当效率(特别是在记忆方面)是至关重要的?

The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?

推荐答案

data.frame 已经是一个列表,你知道每个列表元素是相同的长度(X),最快的事情可能是更新 class row.names 属性:

Since a data.frame is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the class and row.names attributes:

set.seed(21)
n <- 1e6
x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
x <- c(x,x,x,x,x,x)

system.time(a <- as.data.frame(x))
system.time(b <- do.call(data.frame,x))
system.time({
  d <- x  # Skip 'c' so Joris doesn't down-vote me! ;-)
  class(d) <- "data.frame"
  rownames(d) <- 1:n
  names(d) <- make.unique(names(d))
})

identical(a, b)  # TRUE
identical(b, d)  # TRUE

更新 - 这比创建 d 快了〜2倍:

Update - this is ~2x faster than creating d:

system.time({
  e <- x
  attr(e, "row.names") <- c(NA_integer_,n)
  attr(e, "class") <- "data.frame"
  attr(e, "names") <- make.names(names(e), unique=TRUE)
})

identical(d, e)  # TRUE

更新2 - 我忘记了内存消耗。最后一次更新会生成 e 的两个副本。使用属性函数只能减少一个副本。

Update 2 - I forgot about memory consumption. The last update makes two copies of e. Using the attributes function reduces that to only one copy.

set.seed(21)
f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
f <- c(f,f,f,f,f,f)
tracemem(f)
system.time({  # makes 2 copies
  attr(f, "row.names") <- c(NA_integer_,n)
  attr(f, "class") <- "data.frame"
  attr(f, "names") <- make.names(names(f), unique=TRUE)
})

set.seed(21)
g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
g <- c(g,g,g,g,g,g)
tracemem(g)
system.time({  # only makes 1 copy
  attributes(g) <- list(row.names=c(NA_integer_,n),
    class="data.frame", names=make.names(names(g), unique=TRUE))
})

identical(f,g)  # TRUE

这篇关于最高效的列表到data.frame方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆