R:foreach 循环中的 for 循环 [英] R: for loop within a foreach loop
问题描述
示例数据:
df <- data.frame(loc.id = rep(1:10, each = 80*36),
year = rep(rep(1980:2015, each = 80), times = 10),
day = rep(rep(1:80, times = 36),times = 10),
rain = runif(10*36*80, min = 0 , max = 5),
swc = runif(10*36*80,min = 0, max = 50),
SW.max = rep(runif(10, min = 100, max = 200), each = 80*36),
SW.ini = runif(10*36*80),
PETc = runif(10*36*80, min = 0 , max = 1.3),
SW = NA,
PAW = NA,
aetc = NA)
df
包含 10 个地点 1980-2015 年的每日数据(80 天).对于每个位置 X 年组合,我想做以下计算
df
contains daily data (80 days) for 1980-2015 for 10 locations.
For each location X year combination, I want to do following calculation
list.result <- list() # create a list to store all results
ptm <- proc.time()
n <- 0
for(i in seq_along(unique(df$loc.id))){
location <- unique(df$loc.id)[i]
print(location)
for(j in seq_along(unique(df$year))){
yr <- unique(df$year)[j]
print(yr)
df_year <- df[df$loc.id == location & df$year == yr,] # subset data for location i and year y
# for the first row of data frame, i need to calculate some values
SW.ini <- df_year$SW.ini[1]
SW.max <- df_year$SW.max[1]
df_year$PAW[1] <- SW.ini + df_year$rain[1]
df_year$aetc[1] <- ifelse(df_year$PAW[1] >= df_year$swc[1],
df_year$PETc[1],(df_year$PAW[1]/df_year$swc[1])*df_year$PETc[1])
df_year$aetc[1] <- ifelse(df_year$aetc[1] > df_year$PAW[1], df_year$PAW[1], df_year$aetc[1])
df_year$SW[1] <- SW.ini + df_year$rain[1] - df_year$aetc[1]
df_year$SW[1] <- ifelse(df_year$SW[1] > SW.max, SW.max, ifelse(df_year$SW[1] < 0, 0,df_year$SW[1]))
# for row 2 till row n of df_year, I need to do this:
for (day in 2:nrow(df_year)){
df_year$PAW[day] <- df_year$SW[day - 1] + df_year$rain[day]
df_year$aetc[day] <- ifelse(df_year$PAW[day] >= df_year$swc[day], df_year$PETc[day], (df_year$PAW[day]/df_year$swc[day]) * df_year$PETc[day])
df_year$aetc[day] <- ifelse(df_year$aetc[day] > df_year$PAW[day], df_year$PAW[day],df_year$aetc[day])
df_year$SW[day] <- df_year$SW[day - 1] + df_year$rain[day] - df_year$aetc[day]
df_year$SW[day] <- ifelse(df_year$SW[day] > SW.max,SW.max, ifelse(df_year$SW[day] < 0, 0,df_year$SW[day]))
}
n <- n + 1
list.result[[n]] <- df_year
}}
proc.time() - ptm
user system elapsed
8.64 0.00 8.75
final.dat <- rbindlist(list.result)
这个循环是连续的,我认为它是 R 中 foreach 的一个很好的候选者.我还没有真正使用过foreach 所以做了一些在线研究让我想到了这个:
This loop is sequential and I thought it is a good candidate for foreach in R. I have not really worked with foreach so doing some online research brought me to this:
library(doParallel)
cl <- makeCluster(4) # if I understood this correctly, it assings number of cores to be used
registerDoParallel(cl)
foreach(i = seq_along(unique(df$loc.id)) %dopar% {
list.result <- list()
for(j in seq_along(1980:2015)){
df_year <- df[df$loc.id == unique(df$loc.id)[i] & df$year == unique(df$year)[j],] # subset data for location i and year y
# for the first row of data frame, i need to calculate some values
SW.ini <- df_year$SW.ini[1]
SW.max <- df_year$SW.max[1]
df_year$PAW[1] <- SW.ini + df_year$rain[1]
df_year$aetc[1] <- ifelse(df_year$PAW[1] >= df_year$swc[1], df_year$PETc[1],(df_year$PAW[1]/df_year$swc[1])*df_year$PETc[1])
df_year$aetc[1] <- ifelse(df_year$aetc[1] > df_year$PAW[1], df_year$PAW[1], df_year$aetc[1])
df_year$SW[1] <- SW.ini + df_year$rain[1] - df_year$aetc[1]
df_year$SW[1] <- ifelse(df_year$SW[1] > SW.max, SW.max, ifelse(df_year$SW[1] < 0, 0,df_year$SW[1]))
# for row 2 till row n of df_year, I need to do this:
for (day in 2:nrow(df_year)){
df_year$PAW[day] <- df_year$SW[day - 1] + df_year$rain[day]
df_year$aetc[day] <- ifelse(df_year$PAW[day] >= df_year$swc[day], df_year$PETc[day], (df_year$PAW[day]/df_year$swc[day]) * df_year$PETc[day])
df_year$aetc[day] <- ifelse(df_year$aetc[day] > df_year$PAW[day], df_year$PAW[day],df_year$aetc[day])
df_year$SW[day] <- df_year$SW[day - 1] + df_year$rain[day] - df_year$aetc[day]
df_year$SW[day] <- ifelse(df_year$SW[day] > SW.max,SW.max, ifelse(df_year$SW[day] < 0, 0,df_year$SW[day]))
}
list.result[[j]] <- df_year
}
dat <- rbindlist(list.result)
fwrite(dat,paste0(i,"dat.csv"))
}
我的问题是:
1) 以上数据是否适合 foreach
1) Is the above data a good candidate for foreach
2) foreach 中有一个 for 循环.有意义吗?
2) There is a for-loop within the foreach. Does that make sense?
3) 如何进行上述 foreach 运行并返回所有结果
3) How do I make the above foreach run and return all the results
推荐答案
解决您的三个问题:
- 我不这么认为. (计算效率更高的方法可以完全消除添加更多处理能力的需要.)
- 并行处理中的 for 循环本身并没有什么坏处. (实际上,需要对每个块进行的计算越多,并行方法就越有可能提高性能.)
- (如果您使用以下方法则不适用)
使用 Rcpp
和 data.table
代替
使用 C++ 编译逻辑并使用 data.table 分组操作按组应用它可以比基线提高约 2,000 倍的速度,远远超过您希望通过并行化获得的速度.
Using Rcpp
and data.table
instead
Compiling the logic with C++ and applying it by group using data.table grouping operations gives a ~2,000x speed-up from your baseline, far greater than you might hope to get by parallelizing.
在您的原始示例中,它有 39,420,000 行,在我的机器上执行 1.883 秒;在修改后的 28,800 行上,这会在 0.004 秒
On your original example, which had 39,420,000 rows, this executes on my machine in 1.883 seconds; and on the revised one with 28,800 rows, this executes in 0.004 seconds
library(data.table)
library(Rcpp)
定义并编译一个 C++
函数,CalcSW()
内嵌在 R 脚本中:
Define and compile a C++
function, CalcSW()
inline in the R script:
注意:C
/C++
中的计数从 0
开始,与 R
不同的是从 1
开始——这就是为什么这里的索引不同
One note: counting in C
/C++
starts at 0
, unlike R
, which starts at 1
-- that's why the indices are different here
Rcpp::cppFunction('
List CalcSW(NumericVector SW_ini,
NumericVector SW_max,
NumericVector rain,
NumericVector swc,
NumericVector PETc) {
int n = SW_ini.length();
NumericVector SW(n);
NumericVector PAW(n);
NumericVector aetc(n);
double SW_ini_glob = SW_ini[0];
double SW_max_glob = SW_max[0];
SW[0] = SW_ini_glob;
PAW[0] = SW[0] + rain[0];
if (PAW[0] > swc[0]){
aetc[0] = PETc[0];
} else {
aetc[0] = PAW[0]/swc[0]*PETc[0];
}
if (aetc[0] > PAW[0]){
aetc[0] = PAW[0];
}
SW[0] = SW[0] + rain[0] - aetc[0];
if(SW[0] > SW_max_glob){
SW[0] = SW_max_glob;
}
if(SW[0] < 0){
SW[0] = 0;
}
for (int i = 1; i < n; i++) {
PAW[i] = SW[i-1] + rain[i];
if (PAW[i] > swc[i]){
aetc[i] = PETc[i];
} else {
aetc[i] = PAW[i]/swc[i]*PETc[i];
}
if (aetc[i] > PAW[i]){
aetc[i] = PAW[i];
}
SW[i] = SW[i-1] + rain[i] - aetc[i];
if(SW[i] > SW_max_glob){
SW[i] = SW_max_glob;
}
if(SW[i] < 0){
SW[i] = 0;
}
}
return Rcpp::List::create(Rcpp::Named("SW") = SW,
Rcpp::Named("PAW") = PAW,
Rcpp::Named("aetc") = aetc);
}')
创建数据表
df <- data.table(loc.id = rep(1:10, each = 80*36),
year = rep(rep(1980:2015, each = 80), times = 10),
day = rep(rep(1:80, times = 36),times = 10),
rain = runif(10*36*80, min = 0 , max = 5),
swc = runif(10*36*80,min = 0, max = 50),
SW_max = rep(runif(10, min = 100, max = 200), each = 80*36),
SW_ini = runif(10*36*80),
PETc = runif(10*36*80, min = 0 , max = 1.3),
SW = as.numeric(NA),
PAW = as.numeric(NA),
aetc = as.numeric(NA))
setkey(df, loc.id, year, day)
对loc.id
和year
的每个组合在df
上执行函数CalcSW()
,同时将返回值分配给三列:
Execute the function CalcSW()
on the df
for each combination of loc.id
and year
, assign returned values to the three columns simultaneously:
system.time({
df[, c("SW","PAW","aetc") := CalcSW(SW_ini,
SW_max,
rain,
swc,
PETc), keyby = .(loc.id, year)]
})
...
user system elapsed
0.004 0.000 0.004
结果:
head(df)
...
loc.id year day rain swc SW_max SW_ini PETc SW PAW aetc
1: 1 1980 1 0.35813251 28.360715 177.3943 0.69116310 0.2870478 1.038675 1.049296 0.01062025
2: 1 1980 2 1.10331116 37.013022 177.3943 0.02742273 0.4412420 2.125335 1.396808 0.01665171
3: 1 1980 3 1.76680011 32.509970 177.3943 0.66273062 1.1071233 3.807561 2.483467 0.08457420
4: 1 1980 4 3.20966558 8.252797 177.3943 0.12220454 0.3496968 6.840713 4.165693 0.17651342
5: 1 1980 5 1.32498191 14.784203 177.3943 0.66381497 1.2168838 7.573160 7.198845 0.59253503
6: 1 1980 6 0.02547458 47.903637 177.3943 0.21871598 1.0864713 7.418750 7.931292 0.17988449
我不是 100% 肯定我完美地实现了你的逻辑,但是逻辑应该非常简单,可以调整我可能遗漏的地方,我以与你的布局方式非常相似的方式实现它.
I'm not 100% positive I implemented your logic perfectly, but the logic should be pretty straightforward to tweak where I may have missed something, I implemented it in a very similar manner to how you laid it out.
另一个注意事项:使用自动缩进和代码突出显示(无论您使用的是 RStudio 还是 Emacs) 编写 C++
会更容易,如果您创建一个单独的文件,命名类似于 TestCode.cpp
格式如下.
One other note: It's way easier to write C++
with auto-indenting and code highlighting (whether you're using RStudio or Emacs) you get if you create a separate file, named something like TestCode.cpp
formatted like below.
然后,您可以使用 Rcpp::sourceCpp("TestCode.cpp")
在 R 脚本中编译您的函数,或者您可以复制和粘贴除前三行之外的所有内容作为一个字符串作为 Rcpp::cppFunction()
的参数,就像我上面所做的那样.
Then, you can either use Rcpp::sourceCpp("TestCode.cpp")
to compile your function in your R Script, or you can copy and paste everything except for the first three lines as a character string into as an argument of Rcpp::cppFunction()
like I did above.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List CalcSW(NumericVector SW_ini,
NumericVector SW_max,
NumericVector rain,
NumericVector swc,
NumericVector PETc) {
int n = SW_ini.length();
NumericVector SW(n);
NumericVector PAW(n);
NumericVector aetc(n);
double SW_ini_glob = SW_ini[0];
double SW_max_glob = SW_max[0];
SW[0] = SW_ini_glob;
PAW[0] = SW[0] + rain[0];
if (PAW[0] > swc[0]){
aetc[0] = PETc[0];
} else {
aetc[0] = PAW[0]/swc[0]*PETc[0];
}
if (aetc[0] > PAW[0]){
aetc[0] = PAW[0];
}
SW[0] = SW[0] + rain[0] - aetc[0];
if(SW[0] > SW_max_glob){
SW[0] = SW_max_glob;
}
if(SW[0] < 0){
SW[0] = 0;
}
for (int i = 1; i < n; i++) {
PAW[i] = SW[i-1] + rain[i];
if (PAW[i] > swc[i]){
aetc[i] = PETc[i];
} else {
aetc[i] = PAW[i]/swc[i]*PETc[i];
}
if (aetc[i] > PAW[i]){
aetc[i] = PAW[i];
}
SW[i] = SW[i-1] + rain[i] - aetc[i];
if(SW[i] > SW_max_glob){
SW[i] = SW_max_glob;
}
if(SW[i] < 0){
SW[i] = 0;
}
}
return Rcpp::List::create(Rcpp::Named("SW") = SW,
Rcpp::Named("PAW") = PAW,
Rcpp::Named("aetc") = aetc);
}
这篇关于R:foreach 循环中的 for 循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!