使用 do.call 和 ldply 将一长串 data.frames(约 100 万)转换为单个 data.frame 时遇到问题 [英] Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply
问题描述
我知道这里有很多关于使用 do.call 或 ldply 将 data.frames 列表转换为单个 data.frame 的方法的问题,但这个问题是关于理解这两种方法的内部工作原理并尝试弄清楚为什么我无法将具有相同结构、相同字段名称等的近 100 万个 df 的列表连接到单个 data.frame 中.每个 data.frame 为 1 行 21 列.
I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understanding the inner workings of both methods and trying to figure out why I can't get either to work for concatenating a list of almost 1 million df's of the same structure, same field names, etc. into a single data.frame. Each data.frame is of one row and 21 columns.
数据开始是一个 JSON 文件,我使用 fromJSON 将其转换为列表,然后运行另一个 lapply 以提取列表的一部分并转换为 data.frame,最终得到一个 data.frames 列表.
The data started out as a JSON file, which I converted to lists using fromJSON, then ran another lapply to extract part of the list and converted to data.frame and ended up with a list of data.frames.
我试过了:
df <- do.call("rbind", list)
df <- ldply(list)
但我不得不在让它运行长达 3 小时且没有得到任何回报后终止该进程.
but I've had to kill the process after letting it run up to 3 hours and not getting anything back.
有没有更有效的方法来做到这一点?如何解决正在发生的事情以及为什么需要这么长时间?
Is there a more efficient method of doing this? How can I troubleshoot what is happening and why is it taking so long?
仅供参考 - 我在带有 RHEL 的 72GB 四核服务器上使用 RStudio 服务器,所以我认为内存不是问题.会话信息如下:
FYI - I'm using RStudio server on a 72GB quad-core server with RHEL, so I don't think memory is the problem. sessionInfo below:
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] multicore_0.1-7 plyr_1.7.1 rjson_0.2.6
loaded via a namespace (and not attached):
[1] tools_2.14.1
>
推荐答案
鉴于您正在寻找性能,看来应该建议使用 data.table
解决方案.
Given that you are looking for performance, it appears that a data.table
solution should be suggested.
有一个函数 rbindlist
与 相同
但比 do.call(rbind, list)
There is a function rbindlist
which is the same
but much faster than do.call(rbind, list)
library(data.table)
X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.table <- rbindlist(X))
## user system elapsed
## 0.00 0.01 0.02
data.frame
Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.frame <- rbindlist(Xdf))
## user system elapsed
## 0.03 0.00 0.03
比较
system.time(docall <- do.call(rbind, Xdf))
## user system elapsed
## 50.72 9.89 60.88
还有一些适当的基准测试
And some proper benchmarking
library(rbenchmark)
benchmark(rbindlist.data.table = rbindlist(X),
rbindlist.data.frame = rbindlist(Xdf),
docall = do.call(rbind, Xdf),
replications = 5)
## test replications elapsed relative user.self sys.self
## 3 docall 5 276.61 3073.444445 264.08 11.4
## 2 rbindlist.data.frame 5 0.11 1.222222 0.11 0.0
## 1 rbindlist.data.table 5 0.09 1.000000 0.09 0.0
反对@JoshuaUlrich 的解决方案
benchmark(use.rbl.dt = rbl.dt(X),
use.rbl.ju = rbl.ju (Xdf),
use.rbindlist =rbindlist(X) ,
replications = 5)
## test replications elapsed relative user.self
## 3 use.rbindlist 5 0.10 1.0 0.09
## 1 use.rbl.dt 5 0.10 1.0 0.09
## 2 use.rbl.ju 5 0.33 3.3 0.31
我不确定你是否真的需要使用 as.data.frame
,因为 data.table
继承了类 data.frame
I'm not sure you really need to use as.data.frame
, because a data.table
inherits class data.frame
这篇关于使用 do.call 和 ldply 将一长串 data.frames(约 100 万)转换为单个 data.frame 时遇到问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!