使用 do.call 和 ldply 将一长串 data.frames(约 100 万)转换为单个 data.frame 时遇到问题 [英] Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

查看:24
本文介绍了使用 do.call 和 ldply 将一长串 data.frames(约 100 万)转换为单个 data.frame 时遇到问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这里有很多关于使用 do.call 或 ldply 将 data.frames 列表转换为单个 data.frame 的方法的问题,但这个问题是关于理解这两种方法的内部工作原理并尝试弄清楚为什么我无法将具有相同结构、相同字段名称等的近 100 万个 df 的列表连接到单个 data.frame 中.每个 data.frame 为 1 行 21 列.

I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understanding the inner workings of both methods and trying to figure out why I can't get either to work for concatenating a list of almost 1 million df's of the same structure, same field names, etc. into a single data.frame. Each data.frame is of one row and 21 columns.

数据开始是一个 JSON 文件,我使用 fromJSON 将其转换为列表,然后运行另一个 lapply 以提取列表的一部分并转换为 data.frame,最终得到一个 data.frames 列表.

The data started out as a JSON file, which I converted to lists using fromJSON, then ran another lapply to extract part of the list and converted to data.frame and ended up with a list of data.frames.

我试过了:

df <- do.call("rbind", list)
df <- ldply(list)

但我不得不在让它运行长达 3 小时且没有得到任何回报后终止该进程.

but I've had to kill the process after letting it run up to 3 hours and not getting anything back.

有没有更有效的方法来做到这一点?如何解决正在发生的事情以及为什么需要这么长时间?

Is there a more efficient method of doing this? How can I troubleshoot what is happening and why is it taking so long?

仅供参考 - 我在带有 RHEL 的 72GB 四核服务器上使用 RStudio 服务器,所以我认为内存不是问题.会话信息如下:

FYI - I'm using RStudio server on a 72GB quad-core server with RHEL, so I don't think memory is the problem. sessionInfo below:

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] multicore_0.1-7 plyr_1.7.1      rjson_0.2.6    

loaded via a namespace (and not attached):
[1] tools_2.14.1
> 

推荐答案

鉴于您正在寻找性能,看来应该建议使用 data.table 解决方案.

Given that you are looking for performance, it appears that a data.table solution should be suggested.

有一个函数 rbindlist相同 但比 do.call(rbind, list)

There is a function rbindlist which is the same but much faster than do.call(rbind, list)

library(data.table)
X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.table <- rbindlist(X))
##  user  system elapsed 
##  0.00    0.01    0.02

data.frame

Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)

system.time(rbindlist.data.frame <- rbindlist(Xdf))
##  user  system elapsed 
##  0.03    0.00    0.03

比较

system.time(docall <- do.call(rbind, Xdf))
##  user  system elapsed 
## 50.72    9.89   60.88 

还有一些适当的基准测试

And some proper benchmarking

library(rbenchmark)
benchmark(rbindlist.data.table = rbindlist(X), 
           rbindlist.data.frame = rbindlist(Xdf),
           docall = do.call(rbind, Xdf),
           replications = 5)
##                   test replications elapsed    relative user.self sys.self 
## 3               docall            5  276.61 3073.444445    264.08     11.4 
## 2 rbindlist.data.frame            5    0.11    1.222222      0.11      0.0 
## 1 rbindlist.data.table            5    0.09    1.000000      0.09      0.0 

反对@JoshuaUlrich 的解决方案

benchmark(use.rbl.dt  = rbl.dt(X), 
          use.rbl.ju  = rbl.ju (Xdf),
          use.rbindlist =rbindlist(X) ,
          replications = 5)

##              test replications elapsed relative user.self 
## 3  use.rbindlist            5    0.10      1.0      0.09
## 1     use.rbl.dt            5    0.10      1.0      0.09
## 2     use.rbl.ju            5    0.33      3.3      0.31 

我不确定你是否真的需要使用 as.data.frame,因为 data.table 继承了类 data.frame

I'm not sure you really need to use as.data.frame, because a data.table inherits class data.frame

这篇关于使用 do.call 和 ldply 将一长串 data.frames(约 100 万)转换为单个 data.frame 时遇到问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆