在 R 中加速循环的问题 [英] Problems with speeding up loop in R

查看:65
本文介绍了在 R 中加速循环的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个特别大的数据集,它由 3.7 个 mio 行和 76 个字符串列组成.

I have a particularly big dataset which consists of 3.7 mio rows and 76 string columns.

我想比较上面一行和下面一行是否匹配以及是否编写了此代码.应注明上、下行相同花样的数量.

I want to compare the above row with the below row in terms of whether they match and have written this code. The number of same patterns of the above and the below row should be indicated.

   a <- c("a","a","a","a","a","a","a","a","a")
   b <- c("b","b","b","b","a","b","b","b","b")
   c <- c("c","c","c","c","a","a","a","b","b")
   d <- c("d","d","d","d","d","d","d","d","d")
   features_split   <- data.frame(a,b,c,d); features_split
   ncol = max(sapply(features_split,length))
   safe <- as.data.table(lapply(1:ncol,function(i)sapply(features_split,"[",i)))
   nrow(safe)
   df <- safe
   LIST  <-list() 
   LIST2 <-list() 
   for(i in 1:(nrow(df)-1)) 
   { 
   LIST[[i]] <-df[i+1,] %in% df[i,] 
   LIST2[[i]] <- length(LIST[[i]][LIST[[i]]==TRUE]) 
   } 
   safe2   <- unlist(LIST2)
   not_available <- rowSums(!is.na(safe))

运行该循环需要很长时间.我该如何改进?(100.000 行大约 1 小时,但我有超过 3.7 mio)

It takes forever to run that loop. How can I improve? (about 1 hour for 100.000 rows, but I have more than 3.7 mio)

对任何事情都心存感激,托比

Grateful for anything, Tobi

推荐答案

使用 data.frame

概念证明,使用data.frame:

set.seed(4)
nr <- 1000
mydf <- data.frame(a=sample(letters[1:3], nr, repl=TRUE),
                   b=sample(letters[1:3], nr, repl=TRUE),
                   c=sample(letters[1:3], nr, repl=TRUE),
                   d=sample(letters[1:3], nr, repl=TRUE),
                   stringsAsFactors=FALSE)
matches <- vapply(seq.int(nrow(mydf)-1),
                  function(ii,zz) sum(mydf[ii,] == mydf[ii+1,]),
                  integer(1))
head(matches)
## [1] 0 3 4 2 1 0
sum(matches == 4) # total number of perfect row-matches
## 16

matches中,i位置的整数表示第i行有多少个字符串与i行的对应字符串完全匹配+1.0 的匹配表示根本没有匹配,并且(在本例中)4 表示该行是完全匹配的.

In matches, the integer in position i indicates how many strings from row i exactly match the corresponding string from row i+1. A match of 0 means no matches at all, and (in this case) 4 means the row is a perfect match.

为了展示时间,把它放大一点:

Taking it a bit larger for a demonstration of time:

nr <- 100000
nc <- 76
mydf2 <- as.data.frame(matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc),
                       stringsAsFactors=FALSE)
dim(mydf2)
## [1] 100000     76
system.time(
    matches2 <- vapply(seq.int(nrow(mydf2)-1),
                       function(ii) sum(mydf2[ii,] == mydf2[ii+1,]),
                       integer(1))
    )
##    user  system elapsed
##  370.63   12.14  385.36

改用矩阵

如果您能负担得起将其作为矩阵(因为您有字符"的同构数据类型)而不是 data.frame,您将获得更好的性能:

Using a matrix instead

If you can afford to do it as a matrix (since you have a homogenous data type of "character") instead of a data.frame, you'll get considerably better performance:

nr <- 100000
nc <- 76
mymtx2 <- matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc)
dim(mymtx2)
## [1] 10000    76

system.time(
    matches2 <- vapply(seq.int(nrow(mymtx2)-1),
                       function(ii) sum(mymtx2[ii,] == mymtx2[ii+1,]),
                       integer(1))
    )
##     user  system elapsed 
##    0.81    0.00    0.81 

(与上次运行的 370.63 用户 相比.)将其扩展到全强度:

(Compare with 370.63 user from the previous run.) Scaling it up to full-strength:

nr <- 3.7e6
nc <- 76
mymtx3 <- matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc)
dim(mymtx3)
## [1] 3700000      76
system.time(
    matches3 <- vapply(seq.int(nrow(mymtx3)-1),
                       function(ii) sum(mymtx3[ii,] == mymtx3[ii+1,]),
                       integer(1))
    )
##     user  system elapsed 
##   35.32    0.05   35.81 

length(matches3)
## [1] 3699999
sum(matches3 == nc)
## [1] 0

不幸的是,仍然没有匹配项,但我认为 36 秒对于 3.7M 来说比对于 100K 来说是一个小时要好得多.(如果我做出了错误的假设,请纠正我.)

Unfortunately, still no matches, but I think 36 seconds is considerably better for 3.7M than an hour for 100K. (Please correct me if I'm made an incorrect assumption.)

(Ref: win7 x64, R-3.0.3-64bit, intel i7-2640M 2.8GHz, 8GB RAM)

(Ref: win7 x64, R-3.0.3-64bit, intel i7-2640M 2.8GHz, 8GB RAM)

这篇关于在 R 中加速循环的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆