do.call(rbind,list)列数不一致 [英] do.call(rbind, list) for uneven number of column
问题描述
我想将数据绑定为行,以便列名称排队,如果有额外的数据,然后创建列,如果缺少数据,则创建NAs
以下是我正在使用的数据的模拟示例:
x< - list()
x [[1]]< - 字母[seq(2,20,by = 2)]
名称[[1]])< - LETTERS [c(1:length(x [[1]]))]
x [[2]]< - 字母[seq(3,20,by = 3) ]
names(x [[2]])< - LETTERS [seq(3,20,by = 3)]
x [[3]]< - 字母[seq(4,20, by = 4)]
names(x [[3]])< - LETTERS [seq(4,20,by = 4)]
如果我确定每个元素的格式是一样的,下面的行通常是我会做的。
do.call(rbind,x)
我希望有人提出了一个很好的小解决方案,匹配列名称并填写空白 NA
s,同时添加新列如果在绑定过程中找到新列...
rbind.fill
是一个令人敬畏的功能,在data.frames的列表上做得很好。但是,对于这种情况,IMHO,当列表只包含(命名)向量时,可以做得更快。
rbind.fill
way
require(plyr)
rbind.fill(lapply(x,function(y) {as.data.frame(t(y),stringsAsFactors = FALSE)}))
直接的方法(至少对于这种情况有效):
rbind.named.fill< - function(x){
nam< - sapply(x,names)
unam< - unique(unlist(nam))
len< - sapply(x,length)
out< vector(list,length(len))
for(i in seq_along(len)){
out [[i]]< - unname(x [[i]])[match unam,nam [[i]])]
}
setNames(as.data.frame(do.call(rbind,out),stringsAsFactors = FALSE),unam)
}
基本上,我们得到总的唯一名称来形成最终数据的列。帧。然后,我们创建一个长度为= input的列表,然后用 NA
填充其余的值。这可能是最棘手的部分,因为我们必须在填写NA时匹配名称。然后,我们最终将列名设置为列(可以通过引用使用 setnames
从 data.table
现在进行一些基准测试:
数据:
#生成一些巨大的随机数据:
set.seed(45)
sample.fun< - function(){
nam< - sample(LETTERS,sample(5:15))
val< - sample(letters,length(nam))
setNames(val,nam)
}
ll< - replicate(1e4,sample.fun())
功能:
#plyr的rbind.fill版本:
rbind.fill。 plyr< - function(x){
rbind.fill(lapply(x,function(y){as.data.frame(t(y),stringsAsFactors = FALSE)}))
}
rbind.named.fill< - function(x){
nam< - sapply(x,names)
unam< - unique(unlist(nam))
len< - sapply(x,length)
out< - vector(list,length(len))
for(i in seq_along(len)){
out [[i]]< - unname(x [[i]])[match(unam,nam [[i]])]
}
setNames(as.data.frame(do.call(rbind,out),stringsAsFactors = FALSE),unam)
}
更新(增加了GSee的功能):
foo< - function (...)
{
dargs< - list(...)
all.names< - unique(names(unlist(dargs)))
out< ; - do.call(rbind,lapply(dargs,`[`,all.names))
colnames(out)< - all.names
as.data.frame(out,stringsAsFactors = FALSE )
}
基准:
require(microbenchmark)
microbenchmark(t1 < - rbind.named.fill(ll),
t2< - rbind.fill.plyr(ll) ,
t3< - do.call(foo,ll),times = 10)
相同(t1,t2)#TRUE
相同(t1,t3)#TRUE
单位:毫秒
expr最小lq中位数u q max neval
t1 < - rbind.named.fill(ll)243.0754 258.4653 307.2575 359.4332 385.6287 10
t2< - rbind.fill.plyr(ll)16808.3334 17139.3068 17648.1882 17890.9384 18220.2534 10
t3< - do.call(foo,ll)188.5139 204.2514 229.0074 339.6309 359.4995 10
I have a list, with each element being a character vector, of differing lengths I would like to bind the data as rows, so that the column names 'line up' and if there is extra data then create column and if there is missing data then create NAs
Below is a mock example of the data I am working with
x <- list()
x[[1]] <- letters[seq(2,20,by=2)]
names(x[[1]]) <- LETTERS[c(1:length(x[[1]]))]
x[[2]] <- letters[seq(3,20, by=3)]
names(x[[2]]) <- LETTERS[seq(3,20, by=3)]
x[[3]] <- letters[seq(4,20, by=4)]
names(x[[3]]) <- LETTERS[seq(4,20, by=4)]
The below line would normally be what I would do if I was sure that the format for each element was the same...
do.call(rbind,x)
I was hoping that someone had come up with a nice little solution that matches up the column names and fills in blanks with NA
s whilst adding new columns if in the binding process new columns are found...
rbind.fill
is an awesome function that does really well on list of data.frames. But IMHO, for this case, it could be done much faster when the list contains only (named) vectors.
The rbind.fill
way
require(plyr)
rbind.fill(lapply(x,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))
A more straightforward way (and efficient for this scenario at least):
rbind.named.fill <- function(x) {
nam <- sapply(x, names)
unam <- unique(unlist(nam))
len <- sapply(x, length)
out <- vector("list", length(len))
for (i in seq_along(len)) {
out[[i]] <- unname(x[[i]])[match(unam, nam[[i]])]
}
setNames(as.data.frame(do.call(rbind, out), stringsAsFactors=FALSE), unam)
}
Basically, we get total unique names to form the columns of the final data.frame. Then, we create a list with length = input and just fill the rest of the values with NA
. This is probably the "trickiest" part as we've to match the names while filling NA. And then, we set names once finally to the columns (which can be set by reference using setnames
from data.table
package as well if need be).
Now to some benchmarking:
Data:
# generate some huge random data:
set.seed(45)
sample.fun <- function() {
nam <- sample(LETTERS, sample(5:15))
val <- sample(letters, length(nam))
setNames(val, nam)
}
ll <- replicate(1e4, sample.fun())
Functions:
# plyr's rbind.fill version:
rbind.fill.plyr <- function(x) {
rbind.fill(lapply(x,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))
}
rbind.named.fill <- function(x) {
nam <- sapply(x, names)
unam <- unique(unlist(nam))
len <- sapply(x, length)
out <- vector("list", length(len))
for (i in seq_along(len)) {
out[[i]] <- unname(x[[i]])[match(unam, nam[[i]])]
}
setNames(as.data.frame(do.call(rbind, out), stringsAsFactors=FALSE), unam)
}
Update (added GSee's function as well):
foo <- function (...)
{
dargs <- list(...)
all.names <- unique(names(unlist(dargs)))
out <- do.call(rbind, lapply(dargs, `[`, all.names))
colnames(out) <- all.names
as.data.frame(out, stringsAsFactors=FALSE)
}
Benchmarking:
require(microbenchmark)
microbenchmark(t1 <- rbind.named.fill(ll),
t2 <- rbind.fill.plyr(ll),
t3 <- do.call(foo, ll), times=10)
identical(t1, t2) # TRUE
identical(t1, t3) # TRUE
Unit: milliseconds
expr min lq median uq max neval
t1 <- rbind.named.fill(ll) 243.0754 258.4653 307.2575 359.4332 385.6287 10
t2 <- rbind.fill.plyr(ll) 16808.3334 17139.3068 17648.1882 17890.9384 18220.2534 10
t3 <- do.call(foo, ll) 188.5139 204.2514 229.0074 339.6309 359.4995 10
这篇关于do.call(rbind,list)列数不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!