如何使用嵌套while循环优化以下代码?多核有选择吗? [英] How to optimize the following code with nested while-loop? Multicore an option?
问题描述
我遇到了一段需要很长时间才能执行的代码的挑战,我想知道优化这些代码的执行时间的关键技巧是什么.我必须承认输入data.frame是重要的(140,000行),而输出data.frame是大约220,000行.
I am having a challenge with a piece of code that takes very long to execute and I am wondering what are the key tricks to optimize the execution time of this code. I have to admit that the input data.frame is significant (140,000 rows) and that the output data.frame is approximately 220,000 rows.
输入data.frame的示例:
A sample of the input data.frame:
head(extremes)
X_BusinessIDDescription min max month
ID105 2007-12-01 2008-06-01 2007-12-01
ID206 2007-12-01 2009-07-01 2007-12-01
ID204 2007-12-01 2008-02-01 2007-12-01
ID785 2008-07-01 2010-08-01 2008-07-01
ID125 2007-11-01 2008-07-01 2007-11-01
ID107 2007-11-01 2011-06-01 2007-11-01
将随循环扩展的data.frame.初始化data.frame以使结构就位.
The data.frame that will be extended with the loop. The data.frame is initiated to get the structure in place.
output <- extremes[1,]
output
X_BusinessIDDescription min max month
ID105 2007-12-01 2008-06-01 2007-12-01
其他值
IDcounter <- 1
IDmax <- nrow(extremes)
linecounter <- 1
我想优化的while循环:
The while-loop I would like to optimize:
while (IDcounter <= IDmax){
start <- extremes$min[IDcounter]
end <- extremes$max[IDcounter] # add three months
while(start <= end){
output[linecounter,] <- extremes[IDcounter,]
output$month[linecounter] <- start
linecounter <- linecounter+1
start <- seq(start, by ="month", length=2)[2]
}
IDcounter <- IDcounter + 1
}
对于少量的行,此代码执行得非常快,但是随着输出的扩展,它似乎正在减慢速度.
For a small number of rows this code executes pretty quickly, but it seems like it is slowing down as the output extends.
输出看起来像这样:
head(output)
X_BusinessIDDescription min max month
ID105 2007-12-01 2008-06-01 2007-12-01
ID105 2007-12-01 2008-06-01 2008-01-01
ID105 2007-12-01 2008-06-01 2008-02-01
ID105 2007-12-01 2008-06-01 2008-03-01
ID105 2007-12-01 2008-06-01 2008-04-01
ID105 2007-12-01 2008-06-01 2008-05-01
对于极端文件中介于min和max之间的每个月,将创建一行.
For every month in the interval between min and max in the extreme file is an row created.
我也很想学习如何使这段代码能够充分利用可用的计算资源的多个核心.好的,我承认这并不是真正的优化,但是它将减少执行时间,这一点也很重要.
I also would be interested to learn how I can can that this code can take ready of the multiple cores of computing resources available. OK, I admit this is not really an optimization but it will reduce the execution time, which is important as well.
Jochem
推荐答案
正如@CarlWitthoft已经提到的那样,由于许多重复的数据,您必须重新考虑数据结构.
As @CarlWitthoft already mentioned you have to rethink your data structure because of many duplicated data.
在这里找到一种简单的矢量化方法:
Here you find a simple vectorized approach:
## create all possible ranges of months
ranges <- mapply(function(mi, ma) {seq(from=mi, to=ma, by="month")}, mi=extremes$min, ma=extremes$max)
## how many months per ID?
n <- unlist(lapply(ranges, length))
## create new data.frame
output <- data.frame(X_BusinessIDDescription=rep(extremes$X_BusinessIDDescription, n),
min=rep(extremes$min, n),
max=rep(extremes$max, n),
month=as.Date(unlist(ranges), origin="1970-01-01"), stringsAsFactors=FALSE)
与您的方法比较:
extremes <- data.frame(X_BusinessIDDescription=c("ID105", "ID206", "ID204", "ID785", "ID125", "ID107"),
min=as.Date(c("2007-12-01", "2007-12-01", "2007-12-01", "2008-07-01", "2007-11-01", "2007-11-01")),
max=as.Date(c("2008-06-01", "2009-07-01", "2008-02-01", "2010-08-01", "2008-07-01", "2011-06-01")),
month=as.Date(c("2007-12-01", "2007-12-01", "2007-12-01", "2008-07-01", "2007-11-01", "2007-11-01")),
stringsAsFactors=FALSE)
approachWhile <- function(extremes) {
output <- data.frame(X_BusinessIDDescription=NA, min=as.Date("1970-01-01"), max=as.Date("1970-01-01"), month=as.Date("1970-01-01"), stringsAsFactors=FALSE)
IDcounter <- 1
IDmax <- nrow(extremes)
linecounter <- 1
while (IDcounter <= IDmax){
start <- extremes$min[IDcounter]
end <- extremes$max[IDcounter] # add three months
while(start <= end){
output[linecounter,] <- extremes[IDcounter,]
output$month[linecounter] <- start
linecounter <- linecounter+1
start <- seq(start, by ="month", length=2)[2]
}
IDcounter <- IDcounter + 1
}
return(output)
}
approachMapply <- function(extremes) {
ranges <- mapply(function(mi, ma) {seq(from=mi, to=ma, by="month")}, mi=extremes$min, ma=extremes$max)
n <- unlist(lapply(ranges, length))
output <- data.frame(X_BusinessIDDescription=rep(extremes$X_BusinessIDDescription, n),
min=rep(extremes$min, n),
max=rep(extremes$max, n),
month=as.Date(unlist(ranges), origin="1970-01-01"), stringsAsFactors=FALSE)
return(output)
}
identical(approachWhile(extremes), approachMapply(extremes)) ## TRUE
library("rbenchmark")
benchmark(approachWhile(extremes), approachMapply(extremes), order="relative")
# test replications elapsed relative user.self sys.self
#2 approachMapply(extremes) 100 0.176 1.00 0.172 0.000
#1 approachWhile(extremes) 100 6.102 34.67 6.077 0.008
这篇关于如何使用嵌套while循环优化以下代码?多核有选择吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!