rbind.data.frame的性能 [英] Performance of rbind.data.frame
问题描述
情况可以这样模拟:
#create one row
onerowdfr< -do.call(data.frame,c(list(),rnorm(100),lapply(sample字母[1:2],100,replace = TRUE),function(x){factor(x,levels = letters [1:2])})))
colnames(onerowdfr)< (cnt,1:100,sep =),粘贴(cat,1:100,sep =))
#在列表中使用它
someParts< -lapply rbinom(200,1,14/200)* 6 + 1,function(reps){onerowdfr [rep(1,reps),]})
我已经设置参数(随机化),以便它们近似于我的真实情况。
现在,我想在一个数据帧中统一所有这些数据帧。我以为使用rbind会这样做:
system.time(
result< -do.call (rbind,someParts)
)
现在,在我的系统上(这不是特别慢)和上面的设置,这是系统的输出.time:
用户系统已用
5.61 0.00 5.62
近6秒钟的rbinding 254(在我的例子中)行200个变量?当然有必要提高这方面的表现吗?在我的代码中,我必须经常做类似的事情(这是来自多个插补),所以我需要这样做尽可能快。
您可以使用数字变量构建矩阵,并将其转换为最终的因子吗?
在我的系统上,使用数据框:
> system.time(result< -do.call(rbind,someParts))
用户系统已用
2.628 0.000 2.636
使用所有数字矩阵构建列表:
onerowdfr2< - 矩阵(as。 numeric(onerowdfr),nrow = 1)
someParts2< -lapply(rbinom(200,1,14/200)* 6 + 1,
函数(reps){onerowdfr2 [rep(1,reps) ,]})
导致更快的速度 rbind
。
> system.time(result2< -do.call(rbind,someParts2))
用户系统已用
0.001 0.000 0.001
编辑:这是另一种可能性;它只是依次组合每一列。
> system.time({
+ n < - 1:ncol(someParts [[1]])
+名称(n)< - 名称(someParts [[1]])
+ result < - as.data.frame(lapply(n,function(i)
+ unlist(lapply(someParts,`[[`,i)))
+})
用户系统经过
0.810 0.000 0.813
仍然不如使用矩阵那么快。
编辑2:
如果您只有数字和因素,将所有内容转换为数字并不困难, rbind
他们,并将必要的列转换回因素。这假设所有因素都具有完全相同的水平。从整数转换为一个因子也比数字更快,所以我先强制整数。
someParts2< - lapply (someParts,function(x)
matrix(unlist(x),ncol = ncol(x)))
result< -as.data.frame(do.call(rbind,someParts2))$ b $($)
f < - ((a,class)==factor)
for(i in f){
lev< - level(a [[i]])
result [[i]]< - factor(as.integer(result [[i]]),levels = seq_along(lev),labels = lev)
我的系统上的时间是:
用户系统已用
0.090 0.00 0.091
I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.
The situation can be simulated like this:
#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})
I've set the parameters (of the randomization) so that they approximate my true situation.
Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:
system.time(
result<-do.call(rbind, someParts)
)
Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:
user system elapsed
5.61 0.00 5.62
Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.
Can you build your matrices with numeric variables only and convert to a factor at the end? rbind
is a lot faster on numeric matrices.
On my system, using data frames:
> system.time(result<-do.call(rbind, someParts))
user system elapsed
2.628 0.000 2.636
Building the list with all numeric matrices instead:
onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1,
function(reps){onerowdfr2[rep(1, reps),]})
results in a lot faster rbind
.
> system.time(result2<-do.call(rbind, someParts2))
user system elapsed
0.001 0.000 0.001
EDIT: Here's another possibility; it just combines each column in turn.
> system.time({
+ n <- 1:ncol(someParts[[1]])
+ names(n) <- names(someParts[[1]])
+ result <- as.data.frame(lapply(n, function(i)
+ unlist(lapply(someParts, `[[`, i))))
+ })
user system elapsed
0.810 0.000 0.813
Still not nearly as fast as using matrices though.
EDIT 2:
If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind
them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.
someParts2 <- lapply(someParts, function(x)
matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
lev <- levels(a[[i]])
result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}
The timing on my system is:
user system elapsed
0.090 0.00 0.091
这篇关于rbind.data.frame的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!