rbind.data.frame的性能 [英] Performance of rbind.data.frame

查看:197
本文介绍了rbind.data.frame的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框的列表,我确定它们都包含至少一行(实际上,一些只包含一行,而其他行包含给定数量的行),并且它们都具有相同的列(名称和类型)。如果重要,我也确定在行中的任何地方都没有NA。



情况可以这样模拟:

  #create one row 
onerowdfr< -do.call(data.frame,c(list(),rnorm(100),lapply(sample字母[1:2],100,replace = TRUE),function(x){factor(x,levels = letters [1:2])})))
colnames(onerowdfr)< (cnt,1:100,sep =),粘贴(cat,1:100,sep =))
#在列表中使用它
someParts< -lapply rbinom(200,1,14/200)* 6 + 1,function(reps){onerowdfr [rep(1,reps),]})

我已经设置参数(随机化),以便它们近似于我的真实情况。



现在,我想在一个数据帧中统一所有这些数据帧。我以为使用rbind会这样做:

  system.time(
result< -do.call (rbind,someParts)

现在,在我的系统上(这不是特别慢)和上面的设置,这是系统的输出.time:

 用户系统已用
5.61 0.00 5.62

近6秒钟的rbinding 254(在我的例子中)行200个变量?当然有必要提高这方面的表现吗?在我的代码中,我必须经常做类似的事情(这是来自多个插补),所以我需要这样做尽可能快。

解决方案

您可以使用数字变量构建矩阵,并将其转换为最终的因子吗?



在我的系统上,使用数据框:

 > system.time(result< -do.call(rbind,someParts))
用户系统已用
2.628 0.000 2.636

使用所有数字矩阵构建列表:

  onerowdfr2<  - 矩阵(as。 numeric(onerowdfr),nrow = 1)
someParts2< -lapply(rbinom(200,1,14/200)* 6 + 1,
函数(reps){onerowdfr2 [rep(1,reps) ,]})

导致更快的速度 rbind

 > system.time(result2< -do.call(rbind,someParts2))
用户系统已用
0.001 0.000 0.001

编辑:这是另一种可能性;它只是依次组合每一列。

 > system.time({
+ n < - 1:ncol(someParts [[1]])
+名称(n)< - 名称(someParts [[1]])
+ result < - as.data.frame(lapply(n,function(i)
+ unlist(lapply(someParts,`[[`,i)))
+})
用户系统经过
0.810 0.000 0.813

仍然不如使用矩阵那么快。



编辑2:



如果您只有数字和因素,将所有内容转换为数字并不困难, rbind 他们,并将必要的列转换回因素。这假设所有因素都具有完全相同的水平。从整数转换为一个因子也比数字更快,所以我先强制整数。

  someParts2<  -  lapply (someParts,function(x)
matrix(unlist(x),ncol = ncol(x)))
result< -as.data.frame(do.call(rbind,someParts2))$ b $($)
f < - ((a,class)==factor)
for(i in f){
lev< - level(a [[i]])
result [[i]]< - factor(as.integer(result [[i]]),levels = seq_along(lev),labels = lev)

我的系统上的时间是:

 用户系统已用
0.090 0.00 0.091


I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.

The situation can be simulated like this:

#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})

I've set the parameters (of the randomization) so that they approximate my true situation.

Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:

system.time(
result<-do.call(rbind, someParts)
)

Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:

   user  system elapsed 
   5.61    0.00    5.62

Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.

解决方案

Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.

On my system, using data frames:

> system.time(result<-do.call(rbind, someParts))
   user  system elapsed 
  2.628   0.000   2.636 

Building the list with all numeric matrices instead:

onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1, 
                   function(reps){onerowdfr2[rep(1, reps),]})

results in a lot faster rbind.

> system.time(result2<-do.call(rbind, someParts2))
   user  system elapsed 
  0.001   0.000   0.001

EDIT: Here's another possibility; it just combines each column in turn.

> system.time({
+   n <- 1:ncol(someParts[[1]])
+   names(n) <- names(someParts[[1]])
+   result <- as.data.frame(lapply(n, function(i) 
+                           unlist(lapply(someParts, `[[`, i))))
+ })
   user  system elapsed 
  0.810   0.000   0.813  

Still not nearly as fast as using matrices though.

EDIT 2:

If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.

someParts2 <- lapply(someParts, function(x)
                     matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
  lev <- levels(a[[i]])
  result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}

The timing on my system is:

   user  system elapsed 
   0.090    0.00    0.091 

这篇关于rbind.data.frame的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆