R:从优化的角度来看,是否有任何替代循环的子集? [英] R: Are there any alternatives to loops for subsetting from an optimization standpoint?

查看:117
本文介绍了R:从优化的角度来看,是否有任何替代循环的子集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的研究中我遇到的重复分析范例是需要基于所有不同的组id值进行子集,依次对每个组执行统计分析,并将结果放在输出矩阵中用于进一步处理/汇总。 p>

我通常在R中这样做是类似如下:


mat< - read.csv(...)

groupids< - unique(data.mat $ ID)#假设有100个唯一组



results< - matrix(rep(NA,300),ncol = 3,nrow = 100)



< 100){

tempmat< - subset(data.mat,ID == groupids [i])



在tempmat上运行各种统计信息(相关性,回归等),检查

确保这个特定的组在我使用的变量中没有NAs

并将结果分配给x, y和z。



results [i,1]< - x

results [i,2]< - y

>
}


这对我有用,但是取决于数据的大小和组数工作,这可能需要三天。



除了分支出来的并行处理,有什么诡计使这样的运行更快?例如,将循环转换为别的东西(类似于使用包含我想在循环中运行的stats的函数的应用程序),或者不需要实际将数据子集分配给变量?



编辑:



也许这只是常识(或抽样错误),但我尝试子类化括号在我的一些代码,比使用subset命令,它似乎提供了一个轻微的性能增益,让我惊讶。我有一些代码我使用和输出下面使用相同的对象名称如上:


> system.time(for(i in 1:1000){data.mat [data.mat $ ID == groupids [i],]})

用户系统已用时间

361.41 92.62 458.32



> system.time(for(i in 1:1000){subset (data.mat,ID == groupids [i])})

用户系统已用完

378.44 102.03 485.94


UPDATE:

在其中一个答案中,jorgusch建议我使用data.table包来加快我的子集。所以,我应用它本周早些时候我跑了一个问题。在一个具有超过1,500,000行和4列(ID,Var1,Var2,Var3)的数据集中,我想计算每个组中的两个相关性(由ID变量索引)。有略超过50,000组。下面是我的初始代码(非常类似于上面的):


data.mat< - read.csv home ....)

groupids< - unique(data.mat $ ID)



results< (group in)中的(i in 1:length(groupids))中的NA,(length(groupids)* 3)),ncol = 3,nrow = length(groupids))



< {

tempmat< - data.mat [data.mat $ ID == groupids [i],]



results [i,1]< - groupids [i]

results [i,2]< - cor(tempmat $ Var1,tempmat $ Var2,use =pairwise.complete.obs)

results [i,3] < - cor(tempmat $ Var1,tempmat $ Var3,use =pairwise.complete.obs)



}


我现在重新运行了一个确切的测量多长时间,但从我记得,我开始运行它我早上进了办公室,下午一点时间完成了。图5-7小时。



重新组织我的代码以使用data.table ....


.mat< - read.csv(// home ....)

data.mat< - data.table(data.mat)



testfunc < - function(x,y,z){

temp1 < pairwise.complete.obs)

temp2< - cor(x,z,use =pairwise.complete.obs) >
res< - list(temp1,temp2)

res < br>
}



system.time(test< - data.mat [,testfunc(Var1,Var2,Var3),by =ID] )



用户系统已用完

16.41 0.05 17.44


使用data.table将结果与使用for循环来对所有ID进行子集并手动记录结果进行比较,他们似乎给了我相同的答案(但我必须检查一下)。这看起来是一个相当大的速度增加。



更新2:使用子集运行代码最终完成了:


 用户系统已过
17575.79 4247.41 23477.00


更新3:

我想看看是否有什么不同的使用plyr包,也建议。这是我第一次使用它,所以我可能做的事情有点低效,但它仍然有助于与子循环的for循环。



使用与之前相同的变量和设置...


> data.mat< - read.csv(// home ....)

> system.time(hmm < - ddply(data.mat,ID,function(df)c(cor(df $ Var1,df $ Var2,use =pairwise.complete.obs), cor(df $ Var1,df $ Var3,use =pairwise.complete.obs))))

  
250.25 7.35 272.09



解决方案>

这正是 plyr 包旨在使更容易。但是,它不太可能使事情更快 - 大多数时间可能花在做统计。


A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.

How I typically do this in R is something like the following:

data.mat <- read.csv("...")
groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups

results <- matrix(rep("NA",300),ncol=3,nrow=100)

for(i in 1:100) {
tempmat <- subset(data.mat,ID==groupids[i])

#Run various stats on tempmat (correlations, regressions, etc), checking to
#make sure this specific group doesn't have NAs in the variables I'm using
#and assign results to x, y, and z, for example.

results[i,1] <- x
results[i,2] <- y
results[i,3] <- z
}

This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.

Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?

EDIT:

Maybe this is just common knowledge (or sampling error), but I tried subsetting with brackets in some of my code rather than using the subset command, and it seemed to provide a slight performance gain which surprised me. I have some code I used and output below using the same object names as above:

>system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})
user system elapsed
361.41 92.62 458.32

> system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})
user system elapsed
378.44 102.03 485.94

UPDATE:
In one of the answers, jorgusch suggested that I use the data.table package to speed up my subsetting. So, I applied it to a problem I ran earlier this week. In a dataset with a little over 1,500,000 rows, and 4 columns (ID,Var1,Var2,Var3), I wanted to calculate two correlations in each group (indexed by the "ID" variable). There are slightly more than 50,000 groups. Below is my initial code (which is very similar to the above):

data.mat <- read.csv("//home....")
groupids <- unique(data.mat$ID)

results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))

for(i in 1:length(groupids)) {
tempmat <- data.mat[data.mat$ID==groupids[i],]

results[i,1] <- groupids[i]
results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")
results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")

}

I'm re-running that right now for an exact measure of how long that took, but from what I remember, I started it running when I got into the office in the morning and it finished sometime in mid-afternoon. Figure 5-7 hours.

Restructuring my code to use data.table....

data.mat <- read.csv("//home....")
data.mat <- data.table(data.mat)

testfunc <- function(x,y,z) {
temp1 <- cor(x,y,use="pairwise.complete.obs")
temp2 <- cor(x,z,use="pairwise.complete.obs")
res <- list(temp1,temp2)
res
}

system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])

user system elapsed
16.41 0.05 17.44

Comparing the results using data.table to the ones I got from using a for loop to subset all IDs and record results manually, they seem to have given me the same answers(though I'll have to check that a bit more thoroughly). That looks to be a pretty big speed increase.

UPDATE 2: Running the code using subsets finally finished up again:

   user     system   elapsed  
17575.79  4247.41   23477.00

UPDATE 3:
I wanted to see if anything worked out differently using the plyr package that was also recommended. This is my first time using it, so I may have done things somewhat inefficiently, but it still helped substantially compared to the for loop with subsetting.

Using the same variables and setup as before...

>data.mat <- read.csv("//home....")
>system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2, use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))

  user  system elapsed  
250.25    7.35  272.09  

解决方案

This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.

这篇关于R:从优化的角度来看,是否有任何替代循环的子集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆