加快 R 中的循环操作 [英] Speed up the loop operation in R

查看:29
本文介绍了加快 R 中的循环操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中有一个很大的性能问题.我编写了一个迭代 data.frame 对象的函数.它只是向 data.frame 添加一个新列并累积一些内容.(操作简单).data.frame 大约有 850K 行.我的电脑仍在工作(现在大约 10 小时),我不知道运行时间.

I have a big performance problem in R. I wrote a function that iterates over a data.frame object. It simply adds a new column to a data.frame and accumulates something. (simple operation). The data.frame has roughly 850K rows. My PC is still working (about 10h now) and I have no idea about the runtime.

dayloop2 <- function(temp){
    for (i in 1:nrow(temp)){    
        temp[i,10] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                temp[i,10] <- temp[i,9] + temp[i-1,10]                    
            } else {
                temp[i,10] <- temp[i,9]                                    
            }
        } else {
            temp[i,10] <- temp[i,9]
        }
    }
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

任何想法如何加快此操作?

Any ideas how to speed up this operation?

推荐答案

最大的问题和无效的根源是索引 data.frame,我的意思是你使用 temp[,] 的所有这些行.
尽量避免这种情况.我采用了您的功能,更改了索引,并在此处 version_A

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp$`Kumm.` <- res
    return(temp)
}

如您所见,我创建了矢量 res 来收集结果.最后我将它添加到 data.frame 并且我不需要弄乱名称.那么它有多好呢?

As you can see I create vector res which gather results. At the end I add it to data.frame and I don't need to mess with names. So how better is it?

我使用 nrow 从 1,000 到 10,000 x 1,000 运行 data.frame 的每个函数,并使用 system.time

I run each function for data.frame with nrow from 1,000 to 10,000 by 1,000 and measure time with system.time

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

结果是

您可以看到您的版本以指数方式依赖于 nrow(X).修改后的版本是线性关系,简单的lm模型预测85万行计算需要6分10秒.

You can see that your version depends exponentially from nrow(X). Modified version has linear relation, and simple lm model predict that for 850,000 rows computation takes 6 minutes and 10 seconds.

正如 Shane 和 Calimo 在他们的回答中所说,矢量化是提高性能的关键.从您的代码中,您可以移出循环:

As Shane and Calimo states in theirs answers vectorization is a key to better performance. From your code you could move outside of loop:

  • 调理
  • 初始化结果(即temp[i,9])

这导致了这段代码

dayloop2_B <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in 1:nrow(temp)) {
        if (cond[i]) res[i] <- temp[i,9] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

比较此函数的结果,这次是 nrow 从 10,000 到 100,000 x 10,000.

Compare result for this functions, this time for nrow from 10,000 to 100,000 by 10,000.

另一个调整是在循环索引中将 temp[i,9] 更改为 res[i] (在第 i 次循环迭代中完全相同).这又是索引向量和索引 data.frame 之间的区别.
第二件事:当您查看循环时,您会发现不需要遍历所有 i,而只需要循环符合条件的那些.
所以我们开始吧

Another tweak is to changing in a loop indexing temp[i,9] to res[i] (which are exact the same in i-th loop iteration). It's again difference between indexing a vector and indexing a data.frame.
Second thing: when you look on the loop you can see that there is no need to loop over all i, but only for the ones that fit condition.
So here we go

dayloop2_D <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in (1:nrow(temp))[cond]) {
        res[i] <- res[i] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

您获得的性能高度取决于数据结构.准确地说 - 条件中 TRUE 值的百分比.对于我的模拟数据,在一秒以下的 850,000 行中需要计算时间.

Performance which you gain highly depends on a data structure. Precisely - on percent of TRUE values in the condition. For my simulated data it takes computation time for 850,000 rows below the one second.

我希望你能走得更远,我看到至少有两件事可以做:

I you want you can go further, I see at least two things which can be done:

  • 写一个C代码做条件cumsum
  • 如果您知道数据中的最大序列不大,那么您可以将循环更改为矢量化 while,例如

  • write a C code to do conditional cumsum
  • if you know that in your data max sequence isn't large then you can change loop to vectorized while, something like

while (any(cond)) {
    indx <- c(FALSE, cond[-1] & !cond[-n])
    res[indx] <- res[indx] + res[which(indx)-1]
    cond[indx] <- FALSE
}

用于模拟和数字的代码可在 GitHub 上获得.

Code used for simulations and figures is available on GitHub.

这篇关于加快 R 中的循环操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆