data.frame 方法的最有效列表? [英] Most efficient list to data.frame method?

查看:26
本文介绍了data.frame 方法的最有效列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

刚刚与同事就此事进行了交谈,我们认为值得看看 SO 土地上的人们怎么说.假设我有一个包含 N 个元素的列表,其中每个元素都是一个长度为 X 的向量.现在假设我想将其转换为 data.frame.与 R 中的大多数事物一样,有多种方法可以为众所周知的猫剥皮,例如 as.dataframe、使用 plyr 包、将 do.call 结合使用cbind,预先分配DF并填写,等等.

Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as as.dataframe, using the plyr package, comboing do.call with cbind, pre-allocating the DF and filling it in, and others.

出现的问题是当 N 或 X(在我们的例子中是 X)变得非常大时会发生什么.当效率(尤其是在记忆力方面)至关重要时,是否有一种猫剥皮方法特别优越?

The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?

推荐答案

由于一个 data.frame 已经是一个列表并且你知道每个列表元素的长度都相同(X),所以速度最快事情可能只是更新 classrow.names 属性:

Since a data.frame is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the class and row.names attributes:

set.seed(21)
n <- 1e6
x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
x <- c(x,x,x,x,x,x)

system.time(a <- as.data.frame(x))
system.time(b <- do.call(data.frame,x))
system.time({
  d <- x  # Skip 'c' so Joris doesn't down-vote me! ;-)
  class(d) <- "data.frame"
  rownames(d) <- 1:n
  names(d) <- make.unique(names(d))
})

identical(a, b)  # TRUE
identical(b, d)  # TRUE

更新 - 这比创建 d 快约 2 倍:

Update - this is ~2x faster than creating d:

system.time({
  e <- x
  attr(e, "row.names") <- c(NA_integer_,n)
  attr(e, "class") <- "data.frame"
  attr(e, "names") <- make.names(names(e), unique=TRUE)
})

identical(d, e)  # TRUE

更新 2 - 我忘记了内存消耗.最后一次更新制作了 e 的两个副本.使用 attributes 函数将其减少到只有一个副本.

Update 2 - I forgot about memory consumption. The last update makes two copies of e. Using the attributes function reduces that to only one copy.

set.seed(21)
f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
f <- c(f,f,f,f,f,f)
tracemem(f)
system.time({  # makes 2 copies
  attr(f, "row.names") <- c(NA_integer_,n)
  attr(f, "class") <- "data.frame"
  attr(f, "names") <- make.names(names(f), unique=TRUE)
})

set.seed(21)
g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
g <- c(g,g,g,g,g,g)
tracemem(g)
system.time({  # only makes 1 copy
  attributes(g) <- list(row.names=c(NA_integer_,n),
    class="data.frame", names=make.names(names(g), unique=TRUE))
})

identical(f,g)  # TRUE

这篇关于data.frame 方法的最有效列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆