有什么有效的方式比rbind.fill(list)? [英] Is there any efficient way than rbind.fill(list)?
问题描述
我有一个包含不同列的数据框列表,我想将它们合并成一个数据框。我使用rbind.fill来做到这一点。我正在寻找更有效率的东西。
与此处提供的答案类似。
require(plyr)
set.seed(45)
sample.fun< function(){
nam< - sample(LETTERS,sample(5:15))
val< - data.frame TRUE),nrow = 10))
setNames(val,nam)
}
ll < - replicate(1e4,sample.fun())
rbind.fill )
更新: 此更新的答案。
UPDATE(eddi ): c> 填充参数<= c => datatable> version 1.8.11
。例如:
DT1 = data.table(a = 1:2,b = 1:2)
DT2 = data.table(a = 3:4,c = 1:2)
rbind(DT1,DT2,fill = TRUE)
#abc
#1:1 1 NA
#2:2 2 NA
#3:3 NA 1
#4:4 NA 2
FR#4790 立即添加 - rbind.fill(从plyr)喜欢功能合并data.frames / data.tables列表
注1:
此解决方案使用 data.table
的 rbindlist
函数到rbind数据列表列表,为此,确保使用版本1.8.9,因为 this bug in versions& 1.8.9 。
注2:
现在使用 应该注意, 这里是使用 > rbindlist
当data.frames / data.tables的绑定列表,现在,将保留第一列的数据类型。也就是说,如果第一个data.frame中的列是字符,第二个data.frame中的同一列是factor,则
rbindlist
将导致此列一个人物。所以,如果你的data.frame包含所有字符列,那么,这个方法的解决方案将与plyr方法相同。如果不是,值将仍然是相同的,但一些列将是字符而不是因子。你必须自己转换成因子。 希望此行为将在未来更改。 / p>
data.table
(与 rbind.fill
来自
plyr
):
require table)
pre>
rbind.fill.DT< - function(ll){
#改变为lapply以返回列表总是
all.names< - lapply(ll,names)
unq.names< - unique(unlist(all.names))
ll.m< - rbindlist(lapply(seq_along(ll),function(x){
tt< ll [[x]]
setattr(tt,'class',c('data.table','data.frame'))
data.table ::: settruelength(tt,0L)
invisible(alloc.col(tt))
tt [,c(unq.names [!unq.names%chin%all.names [[x]]]):= NA_character_]
setcolorder(tt,unq.names)
}))
}
rbind.fill.PLYR< - function(ll){
rbind.fill )
}
require(microbenchmark)
microbenchmark(t1 <-rbind.fill.DT(11),t2 <-rbind.fill.PLYR(11) times = 10)
#单位:秒
#expr min lq median uq max neval
#t1 < - rbind.fill.DT(ll)10.8943 11.02312 11.26374 11.34757 11.51488 10
#t2 < - rbind.fill.PLYR(11)121.9868 134.52107 136.41375 184.18071 347.74724 10
#用于比较改变t2到data.table
setattr(t2,
不可见(alloc.col(t2))
$ b setcolorder(t2,unique(unlist(sapply(ll,names))))
identical(t1,t2)#[1] TRUE
plyr
的 rbind.fill
边缘通过这个特殊的 data.table
解决方案,直到列表大小约500.
基准点: / h3>
seq(1000,10000,by = 1000)的数据列表长度的运行图。
。我使用 microbenchmark
,每个10个reps在这些不同的列表长度。
基准化提示:
I have a list of data frames with different set of columns, I would like combine them into one data frame. I use rbind.fill to do that. I am looking something that would do it more efficiently. Similar to the answer given here
require(plyr)
set.seed(45)
sample.fun <- function() {
nam <- sample(LETTERS, sample(5:15))
val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
setNames(val, nam)
}
ll <- replicate(1e4, sample.fun())
rbind.fill(ll)
UPDATE: See this updated answer instead.
UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill
argument to rbind
. For example:
DT1 = data.table(a = 1:2, b = 1:2)
DT2 = data.table(a = 3:4, c = 1:2)
rbind(DT1, DT2, fill = TRUE)
# a b c
#1: 1 1 NA
#2: 2 2 NA
#3: 3 NA 1
#4: 4 NA 2
FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables
Note 1:
This solution uses data.table
's rbindlist
function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.
Note 2:
rbindlist
when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist
will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.
And now here's using data.table
(and benchmarking comparison with rbind.fill
from plyr
):
require(data.table)
rbind.fill.DT <- function(ll) {
# changed sapply to lapply to return a list always
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}
rbind.fill.PLYR <- function(ll) {
rbind.fill(ll)
}
require(microbenchmark)
microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10)
# Unit: seconds
# expr min lq median uq max neval
# t1 <- rbind.fill.DT(ll) 10.8943 11.02312 11.26374 11.34757 11.51488 10
# t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724 10
# for comparison change t2 to data.table
setattr(t2, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(t2, 0L)
invisible(alloc.col(t2))
setcolorder(t2, unique(unlist(sapply(ll, names))))
identical(t1, t2) # [1] TRUE
It should be noted that plyr
's rbind.fill
edges past this particular data.table
solution until list size of about 500.
Benchmarking plot:
Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000)
. I've used microbenchmark
with 10 reps on each of these different list lengths.
Benchmarking gist:
Here's the gist for benchmarking, in case anyone wants to replicate the results.
这篇关于有什么有效的方式比rbind.fill(list)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!