加速 R 中的循环操作 [英] Speed up the loop operation in R

查看:32
本文介绍了加速 R 中的循环操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中有一个很大的性能问题.我编写了一个函数来迭代 data.frame 对象.它只是向 data.frame 添加一个新列并累积一些东西.(操作简单).data.frame 大约有 850K 行.我的电脑仍在工作(现在大约 10 小时),我不知道运行时间.

I have a big performance problem in R. I wrote a function that iterates over a data.frame object. It simply adds a new column to a data.frame and accumulates something. (simple operation). The data.frame has roughly 850K rows. My PC is still working (about 10h now) and I have no idea about the runtime.

dayloop2 <- function(temp){
    for (i in 1:nrow(temp)){    
        temp[i,10] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                temp[i,10] <- temp[i,9] + temp[i-1,10]                    
            } else {
                temp[i,10] <- temp[i,9]                                    
            }
        } else {
            temp[i,10] <- temp[i,9]
        }
    }
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

任何想法如何加快此操作?

Any ideas how to speed up this operation?

推荐答案

最大的问题和无效的根源是索引 data.frame,我的意思是你使用 temp[,] 的所有这些行.
尽量避免这种情况.我接受了你的函数,更改索引和这里 version_A

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp$`Kumm.` <- res
    return(temp)
}

如您所见,我创建了收集结果的矢量 res.最后我将它添加到 data.frame 并且我不需要混淆名称.那么它有多好?

As you can see I create vector res which gather results. At the end I add it to data.frame and I don't need to mess with names. So how better is it?

我使用 nrow 从 1,000 到 10,000 x 1,000 运行 data.frame 的每个函数,并使用 system.time

I run each function for data.frame with nrow from 1,000 to 10,000 by 1,000 and measure time with system.time

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

结果是

您可以看到您的版本与 nrow(X) 呈指数关系.修改版本具有线性关系,简单的lm模型预测850,000行计算需要6分10秒.

You can see that your version depends exponentially from nrow(X). Modified version has linear relation, and simple lm model predict that for 850,000 rows computation takes 6 minutes and 10 seconds.

正如 Shane 和 Calimo 在他们的回答中所说,矢量化是提高性能的关键.从您的代码中,您可以移出循环:

As Shane and Calimo states in theirs answers vectorization is a key to better performance. From your code you could move outside of loop:

  • 调节
  • 初始化结果(temp[i,9])

这导致此代码

dayloop2_B <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in 1:nrow(temp)) {
        if (cond[i]) res[i] <- temp[i,9] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

比较这个函数的结果,这次 nrow 从 10,000 到 100,000,乘以 10,000.

Compare result for this functions, this time for nrow from 10,000 to 100,000 by 10,000.

另一个调整是在循环索引中将 temp[i,9] 更改为 res[i](在第 i 次循环迭代中完全相同).索引向量和索引 data.frame 之间又是不同的.
第二件事:当你查看循环时,你可以看到没有必要遍历所有的 i,而只循环满足条件的那些.
所以我们开始

Another tweak is to changing in a loop indexing temp[i,9] to res[i] (which are exact the same in i-th loop iteration). It's again difference between indexing a vector and indexing a data.frame.
Second thing: when you look on the loop you can see that there is no need to loop over all i, but only for the ones that fit condition.
So here we go

dayloop2_D <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in (1:nrow(temp))[cond]) {
        res[i] <- res[i] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

您获得的性能很大程度上取决于数据结构.准确地说 - 条件中 TRUE 值的百分比.对于我的模拟数据,一秒以下的 850,000 行需要计算时间.

Performance which you gain highly depends on a data structure. Precisely - on percent of TRUE values in the condition. For my simulated data it takes computation time for 850,000 rows below the one second.

我希望你可以走得更远,我认为至少可以做两件事:

I you want you can go further, I see at least two things which can be done:

  • 写一段C代码来做条件累加
  • 如果您知道数据中的最大序列不大,那么您可以将循环更改为向量化 while,例如

  • write a C code to do conditional cumsum
  • if you know that in your data max sequence isn't large then you can change loop to vectorized while, something like

while (any(cond)) {
    indx <- c(FALSE, cond[-1] & !cond[-n])
    res[indx] <- res[indx] + res[which(indx)-1]
    cond[indx] <- FALSE
}

用于模拟和图形的代码可在 GitHub 上获得.

Code used for simulations and figures is available on GitHub.

这篇关于加速 R 中的循环操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆