data.frame 方法的最有效列表? [英] Most efficient list to data.frame method?
问题描述
刚刚与同事就此事进行了交谈,我们认为值得看看 SO 土地上的人们怎么说.假设我有一个包含 N 个元素的列表,其中每个元素都是一个长度为 X 的向量.现在假设我想将其转换为 data.frame.与 R 中的大多数事物一样,有多种方法可以为众所周知的猫剥皮,例如 as.dataframe
、使用 plyr 包、将 do.call
与 结合使用cbind
,预先分配DF并填写,等等.
Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as as.dataframe
, using the plyr package, comboing do.call
with cbind
, pre-allocating the DF and filling it in, and others.
出现的问题是当 N 或 X(在我们的例子中是 X)变得非常大时会发生什么.当效率(尤其是在记忆力方面)至关重要时,是否有一种猫剥皮方法特别优越?
The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?
推荐答案
由于一个 data.frame
已经是一个列表并且你知道每个列表元素的长度都相同(X),所以速度最快事情可能只是更新 class
和 row.names
属性:
Since a data.frame
is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the class
and row.names
attributes:
set.seed(21)
n <- 1e6
x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
x <- c(x,x,x,x,x,x)
system.time(a <- as.data.frame(x))
system.time(b <- do.call(data.frame,x))
system.time({
d <- x # Skip 'c' so Joris doesn't down-vote me! ;-)
class(d) <- "data.frame"
rownames(d) <- 1:n
names(d) <- make.unique(names(d))
})
identical(a, b) # TRUE
identical(b, d) # TRUE
更新 - 这比创建 d
快约 2 倍:
Update - this is ~2x faster than creating d
:
system.time({
e <- x
attr(e, "row.names") <- c(NA_integer_,n)
attr(e, "class") <- "data.frame"
attr(e, "names") <- make.names(names(e), unique=TRUE)
})
identical(d, e) # TRUE
更新 2 - 我忘记了内存消耗.最后一次更新制作了 e
的两个副本.使用 attributes
函数将其减少到只有一个副本.
Update 2 - I forgot about memory consumption. The last update makes two copies of e
. Using the attributes
function reduces that to only one copy.
set.seed(21)
f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
f <- c(f,f,f,f,f,f)
tracemem(f)
system.time({ # makes 2 copies
attr(f, "row.names") <- c(NA_integer_,n)
attr(f, "class") <- "data.frame"
attr(f, "names") <- make.names(names(f), unique=TRUE)
})
set.seed(21)
g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
g <- c(g,g,g,g,g,g)
tracemem(g)
system.time({ # only makes 1 copy
attributes(g) <- list(row.names=c(NA_integer_,n),
class="data.frame", names=make.names(names(g), unique=TRUE))
})
identical(f,g) # TRUE
这篇关于data.frame 方法的最有效列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!