加快 R 中的循环操作 [英] Speed up the loop operation in R
问题描述
我在 R 中有一个很大的性能问题.我编写了一个迭代 data.frame
对象的函数.它只是向 data.frame
添加一个新列并累积一些内容.(操作简单).data.frame
大约有 850K 行.我的电脑仍在工作(现在大约 10 小时),我不知道运行时间.
I have a big performance problem in R. I wrote a function that iterates over a data.frame
object. It simply adds a new column to a data.frame
and accumulates something. (simple operation). The data.frame
has roughly 850K rows. My PC is still working (about 10h now) and I have no idea about the runtime.
dayloop2 <- function(temp){
for (i in 1:nrow(temp)){
temp[i,10] <- i
if (i > 1) {
if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) {
temp[i,10] <- temp[i,9] + temp[i-1,10]
} else {
temp[i,10] <- temp[i,9]
}
} else {
temp[i,10] <- temp[i,9]
}
}
names(temp)[names(temp) == "V10"] <- "Kumm."
return(temp)
}
任何想法如何加快此操作?
Any ideas how to speed up this operation?
推荐答案
最大的问题和无效的根源是索引 data.frame,我的意思是你使用 temp[,]
的所有这些行.
尽量避免这种情况.我采用了您的功能,更改了索引,并在此处 version_A
Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,]
.
Try to avoid this as much as possible. I took your function, change indexing and here version_A
dayloop2_A <- function(temp){
res <- numeric(nrow(temp))
for (i in 1:nrow(temp)){
res[i] <- i
if (i > 1) {
if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) {
res[i] <- temp[i,9] + res[i-1]
} else {
res[i] <- temp[i,9]
}
} else {
res[i] <- temp[i,9]
}
}
temp$`Kumm.` <- res
return(temp)
}
如您所见,我创建了矢量 res
来收集结果.最后我将它添加到 data.frame
并且我不需要弄乱名称.那么它有多好呢?
As you can see I create vector res
which gather results. At the end I add it to data.frame
and I don't need to mess with names.
So how better is it?
我使用 nrow
从 1,000 到 10,000 x 1,000 运行 data.frame
的每个函数,并使用 system.time
I run each function for data.frame
with nrow
from 1,000 to 10,000 by 1,000 and measure time with system.time
X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))
结果是
您可以看到您的版本以指数方式依赖于 nrow(X)
.修改后的版本是线性关系,简单的lm
模型预测85万行计算需要6分10秒.
You can see that your version depends exponentially from nrow(X)
. Modified version has linear relation, and simple lm
model predict that for 850,000 rows computation takes 6 minutes and 10 seconds.
正如 Shane 和 Calimo 在他们的回答中所说,矢量化是提高性能的关键.从您的代码中,您可以移出循环:
As Shane and Calimo states in theirs answers vectorization is a key to better performance. From your code you could move outside of loop:
- 调理
- 初始化结果(即
temp[i,9]
)
这导致了这段代码
dayloop2_B <- function(temp){
cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
res <- temp[,9]
for (i in 1:nrow(temp)) {
if (cond[i]) res[i] <- temp[i,9] + res[i-1]
}
temp$`Kumm.` <- res
return(temp)
}
比较此函数的结果,这次是 nrow
从 10,000 到 100,000 x 10,000.
Compare result for this functions, this time for nrow
from 10,000 to 100,000 by 10,000.
另一个调整是在循环索引中将 temp[i,9]
更改为 res[i]
(在第 i 次循环迭代中完全相同).这又是索引向量和索引 data.frame
之间的区别.
第二件事:当您查看循环时,您会发现不需要遍历所有 i
,而只需要循环符合条件的那些.
所以我们开始吧
Another tweak is to changing in a loop indexing temp[i,9]
to res[i]
(which are exact the same in i-th loop iteration).
It's again difference between indexing a vector and indexing a data.frame
.
Second thing: when you look on the loop you can see that there is no need to loop over all i
, but only for the ones that fit condition.
So here we go
dayloop2_D <- function(temp){
cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
res <- temp[,9]
for (i in (1:nrow(temp))[cond]) {
res[i] <- res[i] + res[i-1]
}
temp$`Kumm.` <- res
return(temp)
}
您获得的性能高度取决于数据结构.准确地说 - 条件中 TRUE
值的百分比.对于我的模拟数据,在一秒以下的 850,000 行中需要计算时间.
Performance which you gain highly depends on a data structure. Precisely - on percent of TRUE
values in the condition.
For my simulated data it takes computation time for 850,000 rows below the one second.
我希望你能走得更远,我看到至少有两件事可以做:
I you want you can go further, I see at least two things which can be done:
- 写一个
C
代码做条件cumsum 如果您知道数据中的最大序列不大,那么您可以将循环更改为矢量化 while,例如
- write a
C
code to do conditional cumsum if you know that in your data max sequence isn't large then you can change loop to vectorized while, something like
while (any(cond)) {
indx <- c(FALSE, cond[-1] & !cond[-n])
res[indx] <- res[indx] + res[which(indx)-1]
cond[indx] <- FALSE
}
用于模拟和数字的代码可在 GitHub 上获得.
Code used for simulations and figures is available on GitHub.
这篇关于加快 R 中的循环操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!