有什么有效的方式比rbind.fill(list)? [英] Is there any efficient way than rbind.fill(list)?

查看:218
本文介绍了有什么有效的方式比rbind.fill(list)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含不同列的数据框列表,我想将它们合并成一个数据框。我使用rbind.fill来做到这一点。我正在寻找更有效率的东西。
此处提供的答案类似。

  require(plyr)

set.seed(45)
sample.fun< function(){
nam< - sample(LETTERS,sample(5:15))
val< - data.frame TRUE),nrow = 10))
setNames(val,nam)
}
ll < - replicate(1e4,sample.fun())
rbind.fill )


解决方案

更新: 此更新的答案



UPDATE(eddi ) c> 填充参数<= c => datatable> version 1.8.11 。例如:

  DT1 = data.table(a = 1:2,b = 1:2)
DT2 = data.table(a = 3:4,c = 1:2)

rbind(DT1,DT2,fill = TRUE)
#abc
#1:1 1 NA
#2:2 2 NA
#3:3 NA 1
#4:4 NA 2






FR#4790 立即添加 - rbind.fill(从plyr)喜欢功能合并data.frames / data.tables列表



注1:



此解决方案使用 data.table rbindlist 函数到rbind数据列表列表,为此,确保使用版本1.8.9,因为 this bug in versions& 1.8.9



注2:



rbindlist 当data.frames / data.tables的绑定列表,现在,将保留第一列的数据类型。也就是说,如果第一个data.frame中的列是字符,第二个data.frame中的同一列是factor,则 rbindlist 将导致此列一个人物。所以,如果你的data.frame包含所有字符列,那么,这个方法的解决方案将与plyr方法相同。如果不是,值将仍然是相同的,但一些列将是字符而不是因子。你必须自己转换成因子。 希望此行为将在未来更改。 / p>

现在使用 data.table (与 rbind.fill 来自 plyr ):

  require table)
rbind.fill.DT< - function(ll){
#改变为lapply以返回列表总是
all.names< - lapply(ll,names)
unq.names< - unique(unlist(all.names))
ll.m< - rbindlist(lapply(seq_along(ll),function(x){
tt< ll [[x]]
setattr(tt,'class',c('data.table','data.frame'))
data.table ::: settruelength(tt,0L)
invisible(alloc.col(tt))
tt [,c(unq.names [!unq.names%chin%all.names [[x]]]):= NA_character_]
setcolorder(tt,unq.names)
}))
}

rbind.fill.PLYR< - function(ll){
rbind.fill )
}

require(microbenchmark)
microbenchmark(t1 <-rbind.fill.DT(11),t2 <-rbind.fill.PLYR(1​​1) times = 10)
#单位:秒
#expr min lq median uq max neval
#t1 < - rbind.fill.DT(ll)10.8943 11.02312 11.26374 11.34757 11.51488 10
#t2 < - rbind.fill.PLYR(1​​1)121.9868 134.52107 136.41375 184.18071 347.74724 10


#用于比较改变t2到data.table
setattr(t2,
不可见(alloc.col(t2))


$ b setcolorder(t2,unique(unlist(sapply(ll,names))))

identical(t1,t2)#[1] TRUE
pre>

应该注意, plyr rbind.fill 边缘通过这个特殊的 data.table 解决方案,直到列表大小约500.



基准点: / h3>

这里是使用 seq(1000,10000,by = 1000)的数据列表长度的运行图。 。我使用 microbenchmark ,每个10个reps在这些不同的列表长度。



>



基准化提示:



>

I have a list of data frames with different set of columns, I would like combine them into one data frame. I use rbind.fill to do that. I am looking something that would do it more efficiently. Similar to the answer given here

require(plyr)

set.seed(45)
sample.fun <- function() {
   nam <- sample(LETTERS, sample(5:15))
   val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
   setNames(val, nam)  
}
ll <- replicate(1e4, sample.fun())
rbind.fill(ll)

解决方案

UPDATE: See this updated answer instead.

UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill argument to rbind. For example:

DT1 = data.table(a = 1:2, b = 1:2)
DT2 = data.table(a = 3:4, c = 1:2)

rbind(DT1, DT2, fill = TRUE)
#   a  b  c
#1: 1  1 NA
#2: 2  2 NA
#3: 3 NA  1
#4: 4 NA  2


FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables

Note 1:

This solution uses data.table's rbindlist function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.

Note 2:

rbindlist when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.

And now here's using data.table (and benchmarking comparison with rbind.fill from plyr):

require(data.table)
rbind.fill.DT <- function(ll) {
    # changed sapply to lapply to return a list always
    all.names <- lapply(ll, names)
    unq.names <- unique(unlist(all.names))
    ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
        tt <- ll[[x]]
        setattr(tt, 'class', c('data.table', 'data.frame'))
        data.table:::settruelength(tt, 0L)
        invisible(alloc.col(tt))
        tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
        setcolorder(tt, unq.names)
    }))
}

rbind.fill.PLYR <- function(ll) {
    rbind.fill(ll)
}

require(microbenchmark)
microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10)
# Unit: seconds
#                      expr      min        lq    median        uq       max neval
#   t1 <- rbind.fill.DT(ll)  10.8943  11.02312  11.26374  11.34757  11.51488    10
# t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724    10


# for comparison change t2 to data.table
setattr(t2, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(t2, 0L)
invisible(alloc.col(t2))
setcolorder(t2, unique(unlist(sapply(ll, names))))

identical(t1, t2) # [1] TRUE

It should be noted that plyr's rbind.fill edges past this particular data.table solution until list size of about 500.

Benchmarking plot:

Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000). I've used microbenchmark with 10 reps on each of these different list lengths.

Benchmarking gist:

Here's the gist for benchmarking, in case anyone wants to replicate the results.

这篇关于有什么有效的方式比rbind.fill(list)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆