R代码运行太慢，如何重写此代码 [英] R code runs too slow,how to rewrite this code

查看：114 发布时间：2020/9/20 19:50:11 r loops bigdata

本文介绍了R代码运行太慢，如何重写此代码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

input.txt包含8000000行和4列.前2列为文字，后2列为数字.第1列和第2列中的唯一符号(例如"c33")的数量不固定.第3列和第4列的值分别是第1列和第2列的唯一符号数，分别用]"分割. input.txt文件的每一行都是这样的:

The input.txt contains 8000000 rows and 4 columns. The first 2 columns is text.The last 2 columns is number. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. The value of columns 3 and 4 is the number of unique symbols of columns 1 and 2 after splitting by "]" respectively. Each row of input.txt file is like this:

c33]c21]c5]c7]c8]c9  TPS2]MIC17]ERG3]NNF1]CIS3]CWP2  6  6

**理想的结果:

**The desired result:

row[ , ] represents characters like "c33 c21 c5 c7 c8 c9" or "TPS2 MIC17 ERG3 NNF1 CIS3 CWP2", | .| represents the number of characters, |c33 c21 c5 c7 c8 c9|=6

如果两行重叠(> = 0.6)，则输出NO.这两行中的一个保存到文件中.**

If two rows are overlapped (>=0.6), it outputs the NO. of these two rows to a file.**

此代码如下，但是运行太慢.

This code is as follows, but it runs too slow.

代码:

 library(compiler)
 enableJIT(3)
 data<-read.table("input.txt",header=FALSE)
 row<-8000000
for (i in 1:(row-1)){
    row11<-unlist(strsplit(as.character(data[i,1]),"]"))
     row12<-unlist(strsplit(as.character(data[i,2]),"]"))
    s1<-data[i,3]*data[i,4]
    zz<-file(paste("output",i,".txt",sep=""),"w")
   for (j in (i+1):row)
      { row21<-unlist(strsplit(as.character(data[j,1]),"]"))
        row22<-unlist(strsplit(as.character(data[j,2]),"]"))
        up<-length(intersect(row11,row21))*length(intersect(row12,row22))
        s2<-data[j,3]*data[j,4]
        down<-min(s1,s2)
       if ((up/down)>=0.6) cat(i,"\t",j,"\n",file=zz,append=TRUE)
      }
   close(zz)
}

运行结果: 每行可以产生一个文件，就像这样:

The running result: each row can produce a file, it is like this:

为了快速运行，我重写了代码.代码如下

In order to run fast, I rewrite the code.The code is as follows

input.txt包含16000000行.列数不固定.第1列和第2列中的唯一符号(例如"c33")的数量不固定. input.txt文件的每两行是这样的:

The input.txt contains 16000000 rows. The number of columns is not fixed. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. Each two rows of input.txt file is like this:

The 1st row  (odd row1):  c33 c21 c5 c7 c8
The 2nd row (even row1): TPS2 MIC17 ERG3 NNF1 CIS3 CWP2 MCM6
The 3rd row  (odd row2): c33 c21 c5 c21 c18 c4 c58
The 4th row (even row2): TPS12 MIC3 ERG2 NNF1 CIS4

**期望的结果:

**The desired result:

如果两行与其他两行重叠(> = 0.6)，则输出NO.这两行中的一个保存到文件中.**

If two rows are overlapped (>=0.6) with other two rows, it outputs the NO. of these two rows to a file.**

代码:

 library(compiler)
    enableJIT(3)
    con <- file("input.txt", "r")
    zz<-file("output.txt","w")
    oddrow1<-readLines(con,n=1)  
    j<-0
    i<-0 
    while( length(oddrow1) != 0 ){
    oddrow1<-strsplit(oddrow1," ")
    evenrow1<-readLines(con,n=1)
    evenrow1<-strsplit(evenrow1," ")
    j<-j+1
    con2 <- file("input.txt", "r")
    readLines(con2,n=(j*2))
    oddrow2<-readLines(con2,n=1) 
    i<-j
    while( length(oddrow2) != 0 ){
       i<-i+1
       oddrow2<-strsplit(oddrow2," ")
       evenrow2<-readLines(con2,n=1)
       evenrow2<-strsplit(evenrow2," ")
       oddrow1<-unlist(oddrow1)
       oddrow2<-unlist(oddrow2)
       evenrow1<-unlist(evenrow1)
       evenrow2<-unlist(evenrow2)
       up<-length(intersect(oddrow1,oddrow2))*length(intersect(evenrow1,evenrow2))
       down<-min(length(oddrow1)*length(evenrow1),length(oddrow2)*length(evenrow2))

       if ((up/down)>=0.6) {cat(j,"\t",i,"\n",file=zz,append=TRUE)  } 
       oddrow2<-readLines(con2,n=1)
       }
    close(con2)
    oddrow1<-readLines(con,n=1)
    }
    close(con)  
    close(zz)

运行结果: 它可以产生一个文件，就像这样:

The running result: it can produce a file, it is like this:

以上两种方法都太慢，为了快速运行，如何重写此代码.谢谢！

Both the above two methods are too slow, In order to run fast,how to rewrite this code. Thank you!

推荐答案

好吧，我怀疑您的数据量使用了过多的内存，但也许会激发一些想法.

Well, I suspect uses too much memory for your size of data, but perhaps it will provoke some ideas.

组成一些数据，每个唯一值总计20个，每个单元格5到10.

Make up some data, with 20 total unique values and 5 to 10 in each cell.

set.seed(5)
n <- 1000L
ng <- 20
g1 <- paste(sample(10000:99999, ng))
g2 <- paste(sample(10000:99999, ng))
n1 <- sample(5:10, n, replace=TRUE)
n2 <- sample(5:10, n, replace=TRUE)
x1 <- sapply(n1, function(i) paste(g1[sample(ng, i)], collapse="|"))
x2 <- sapply(n2, function(i) paste(g2[sample(ng, i)], collapse="|"))

加载矩阵库和一个辅助函数，该函数获取字符串向量列表，并将其转换为列数等于唯一字符串数且其所在位置为1的矩阵.

Load Matrix library and a helper function that takes a list of string vectors and converts them to a matrix with number of columns equal to the number of unique strings and 1's where it was present.

library(Matrix)
str2mat <- function(s) {
  n <- length(s)
  ni <- sapply(s, length)
  s <- unlist(s)
  u <- unique(s)
  spMatrix(nrow=n, ncol=length(u), i=rep(1L:n, ni), j=match(s, u), x=rep(1, length(s)))
}

好的，现在我们可以做些事情了.首先创建矩阵，然后获取每一行中存在的总数.

OK, now we can actually do something. First create the matrices and get the total number present in each row.

m1 <- str2mat(strsplit(x1, "|", fixed=TRUE))
m2 <- str2mat(strsplit(x2, "|", fixed=TRUE))
n1 <- rowSums(m1)
n2 <- rowSums(m2)

现在，我们可以使用这些矩阵的叉积来获取分子，而使用outer来获取最小值以获取分子.然后，我们可以计算重叠并测试> 0.6.由于我们拥有整个矩阵，因此我们对对角线或下半部分不感兴趣. (有一些方法可以通过Matrix库更有效地存储这种矩阵，但是我不确定如何存储.)然后，我们得到与which有足够重叠的行.

Now we can use crossproducts of these matrices to get the numerator, and outer to get the minimum to get the numerator. We then can compute the overlap and test if > 0.6. Since we have the whole matrix, we're not interested in the diagonal or the lower half. (There's ways of storing this kind of matrix more efficiently with Matrix library, but I'm not sure how.) We then get the rows that have enough overlap with which.

num <- tcrossprod(m1)*tcrossprod(m2)
n12 <- n1*n2
den <- outer(n12, n12, pmin)
use <- num/den > 0.6
diag(use) <- FALSE
use[lower.tri(use)] <- FALSE
out <- which(use, arr.ind=TRUE)

> head(out)
     [,1] [,2]
[1,]   64   65
[2,]   27   69
[3,]   34   81
[4,]   26   82
[5,]    5   85
[6,]   21  115

这篇关于R代码运行太慢，如何重写此代码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R代码运行太慢，如何重写此代码 [英] R code runs too slow,how to rewrite this code

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R代码运行太慢，如何重写此代码 [英] R code runs too slow,how to rewrite this code

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭