最高效的列表到data.frame方法? [英] Most efficient list to data.frame method?
问题描述
刚刚和同事谈过这个话题,我们认为值得看看人们在SO土地上不得不说的话。假设我有一个包含N个元素的列表,其中每个元素都是长度为X的向量。现在假设我想将它转换成一个data.frame。和R中的大多数东西一样,使用plyr包,组合 do,可以使用多种方法来剥皮这种特殊的猫,例如
,预分配DF并填充它,等等。 as.dataframe
。使用 cbind
调用
Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as as.dataframe
, using the plyr package, comboing do.call
with cbind
, pre-allocating the DF and filling it in, and others.
提出的问题是当N或X(在我们的例子中是X)变得非常大时会发生什么。是否有一种猫皮肤方法,当效率(特别是在记忆方面)是至关重要的?
The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?
推荐答案
data.frame
已经是一个列表,你知道每个列表元素是相同的长度(X),最快的事情可能是更新 class
和 row.names
属性:
Since a data.frame
is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the class
and row.names
attributes:
set.seed(21)
n <- 1e6
x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
x <- c(x,x,x,x,x,x)
system.time(a <- as.data.frame(x))
system.time(b <- do.call(data.frame,x))
system.time({
d <- x # Skip 'c' so Joris doesn't down-vote me! ;-)
class(d) <- "data.frame"
rownames(d) <- 1:n
names(d) <- make.unique(names(d))
})
identical(a, b) # TRUE
identical(b, d) # TRUE
更新 - 这比创建 d
快了〜2倍:
Update - this is ~2x faster than creating d
:
system.time({
e <- x
attr(e, "row.names") <- c(NA_integer_,n)
attr(e, "class") <- "data.frame"
attr(e, "names") <- make.names(names(e), unique=TRUE)
})
identical(d, e) # TRUE
更新2 - 我忘记了内存消耗。最后一次更新会生成 e
的两个副本。使用属性
函数只能减少一个副本。
Update 2 - I forgot about memory consumption. The last update makes two copies of e
. Using the attributes
function reduces that to only one copy.
set.seed(21)
f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
f <- c(f,f,f,f,f,f)
tracemem(f)
system.time({ # makes 2 copies
attr(f, "row.names") <- c(NA_integer_,n)
attr(f, "class") <- "data.frame"
attr(f, "names") <- make.names(names(f), unique=TRUE)
})
set.seed(21)
g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
g <- c(g,g,g,g,g,g)
tracemem(g)
system.time({ # only makes 1 copy
attributes(g) <- list(row.names=c(NA_integer_,n),
class="data.frame", names=make.names(names(g), unique=TRUE))
})
identical(f,g) # TRUE
这篇关于最高效的列表到data.frame方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!