用group-id替换数据的for循环的高性能替代品是什么? [英] What's the higher-performance alternative to for-loops for subsetting data by group-id?

查看:93
本文介绍了用group-id替换数据的for循环的高性能替代品是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在研究中遇到的重复分析范例是需要基于所有不同的组id值进行子集化,依次对每个组进行统计分析,并将结果放入输出矩阵中以进行进一步的处理/汇总.

A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.

我通常如何在R中执行以下操作:

How I typically do this in R is something like the following:

data.mat <- read.csv("...")  
groupids <- unique(data.mat$ID)  #Assume there are then 100 unique groups

results <- matrix(rep("NA",300),ncol=3,nrow=100)  

for(i in 1:100) {  
  tempmat <- subset(data.mat,ID==groupids[i])  

  # Run various stats on tempmat (correlations, regressions, etc), checking to  
  # make sure this specific group doesn't have NAs in the variables I'm using  
  # and assign results to x, y, and z, for example.  

  results[i,1] <- x  
  results[i,2] <- y  
  results[i,3] <- z  
}

这最终对我有用,但是取决于数据的大小和正在使用的组的数量,这最多可能需要三天.

This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.

除了扩展到并行处理之外,还有什么技巧"可以使这种程序运行得更快?例如,将循环转换为其他形式(类似于带有包含我要在循环内运行的统计信息的函数的应用),或者消除了将数据子集实际分配给变量的需要?

Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?

也许这只是常识(或抽样错误),但是我尝试在一些代码中使用方括号进行子集设置,而不是使用subset命令,这似乎带来了一点点性能提升,这令我感到惊讶.我使用了一些代码,并使用与上面相同的对象名称在下面输出:

Maybe this is just common knowledge (or sampling error), but I tried subsetting with brackets in some of my code rather than using the subset command, and it seemed to provide a slight performance gain which surprised me. I have some code I used and output below using the same object names as above:

system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})  

   user  system elapsed  
 361.41   92.62  458.32

system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})  

   user  system elapsed   
 378.44  102.03  485.94

更新:

在答案之一中,jorgusch建议我使用data.table包来加快子集的设置.因此,我将其应用于本周早些时候运行的问题.在一个稍微超过1,500,000行,4列(ID,Var1,Var2,Var3)的数据集中,我想计算每组中的两个相关性(由"ID"变量索引).大约有50,000个以上的组.下面是我的初始代码(与上面的代码非常相似):

Update:

In one of the answers, jorgusch suggested that I use the data.table package to speed up my subsetting. So, I applied it to a problem I ran earlier this week. In a dataset with a little over 1,500,000 rows, and 4 columns (ID,Var1,Var2,Var3), I wanted to calculate two correlations in each group (indexed by the "ID" variable). There are slightly more than 50,000 groups. Below is my initial code (which is very similar to the above):

data.mat <- read.csv("//home....")  
groupids <- unique(data.mat$ID)

results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))  

for(i in 1:length(groupids)) {  
  tempmat <- data.mat[data.mat$ID==groupids[i],] 

  results[i,1] <- groupids[i]  
  results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")  
  results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")    

}  

我现在正在重新运行该设备,以确切地了解花费了多长时间,但是从我记得的那天起,我是在早上进入办公室时就开始运行它的,直到下午中旬才结束.图5-7小时.

I'm re-running that right now for an exact measure of how long that took, but from what I remember, I started it running when I got into the office in the morning and it finished sometime in mid-afternoon. Figure 5-7 hours.

重组我的代码以使用data.table ....

Restructuring my code to use data.table....

data.mat <- read.csv("//home....")  
data.mat <- data.table(data.mat)  

testfunc <- function(x,y,z) {  
  temp1 <- cor(x,y,use="pairwise.complete.obs")  
  temp2 <- cor(x,z,use="pairwise.complete.obs")  
  res <- list(temp1,temp2)  
  res  
}  

system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])  

 user  system  elapsed  
16.41    0.05    17.44  

将使用data.table的结果与使用for循环对所有ID进行子集并手动记录结果所获得的结果进行比较,它们似乎给了我相同的答案(尽管我必须检查更多一点)彻底).这看起来是一个相当大的速度提高.

Comparing the results using data.table to the ones I got from using a for loop to subset all IDs and record results manually, they seem to have given me the same answers(though I'll have to check that a bit more thoroughly). That looks to be a pretty big speed increase.

使用子集运行代码最终再次完成:

Running the code using subsets finally finished up again:

   user     system   elapsed  
17575.79  4247.41   23477.00

更新3:

我想看看使用推荐的plyr软件包是否有任何不同的解决方案.这是我第一次使用它,因此我可能做事效率不高,但与带子集的for循环相比,它仍然有很大帮助.

Update 3:

I wanted to see if anything worked out differently using the plyr package that was also recommended. This is my first time using it, so I may have done things somewhat inefficiently, but it still helped substantially compared to the for loop with subsetting.

使用与以前相同的变量和设置...

Using the same variables and setup as before...

data.mat <- read.csv("//home....")  
system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2,  use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))  

  user  system elapsed  
250.25    7.35  272.09  

推荐答案

这几乎完全是plyr包旨在简化的目的.但是,它不太可能使速度更快-大多数时间可能花费在进行统计上.

This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.

这篇关于用group-id替换数据的for循环的高性能替代品是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆