并行化并加速R代码以读取许多文件 [英] Parallelize and speed up R code to read in many files

查看:140
本文介绍了并行化并加速R代码以读取许多文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个完全符合我的目的的代码(它读取具有特定模式的一些文件,读取每个文件中的矩阵并使用每个文件对来计算一些东西...最终的输出是具有相同大小的矩阵

  m < -  100 
输出< - 矩阵(0,m ,m)

lista< - list.files(pattern =q)
listan< - as.matrix(lista)
n< - nrow(listan)$ b ($ 1 $ n
AA $ lt; - read.table((listan [i,]),header = FALSE)
A< - as.matrix( AA)
dVarX < - sqrt(mean(A * A))

for(j in i:n){
BB < - read.table((listan (平均值(B * B))))$ b $(b,b,b,b) b输出[i,j] < - (sqrt(mean(A * B)))/ V
}
}

我的问题是花了很多时间(我有大约5000个矩阵,这意味着5000x5000循环)。
我想并行化,但我需要一些帮助!
等待你的友情建议!

提前谢谢!

Gab

解决方案

瓶颈可能是从磁盘读取的。并行运行代码不能保证让事情变得更快。在这种情况下,试图同时读取同一磁盘的多个进程可能比单个进程慢。

由于矩阵是由另一个R进程,你真的应该把它们保存为R的二进制格式。你只读了一次矩阵,所以让你的程序更快的唯一方法就是更快地读取磁盘。



下面是一个例子,告诉你如何可以更快一些:

$ p $ #将一些随机数据写入磁盘
set.seed(21) (i在0:9中)
{
m < - 矩阵(runif(700 * 700),700,700)
f < - paste0(f,i)
write(m,f,700)#text format
saveRDS(m,paste0(f,。rds))#二进制格式
}

#初始化二输出对象
m <-10
o1 < - o2 < - 矩阵(NA,m,m)

#获取文件名列表
文件< ; - list.files(pattern =^ f [[:digit:]] + $)
n < - length(files)

首先,让我们使用 scan 来运行您的代码,这已经比当前的解决方案快很多了。 c> read.table 。

  system.time({
for(i in 1:n){
A < - scan(文件[i],quiet = TRUE)

for(j in i:n){
B < - scan(files [j],quiet = TRUE)
o1 (平均(A * B))/ sqrt(平均(A * A))* sqrt(平均(B * B)))
}
}
})
#用户系统已用
#31.37 0.78 32.58

现在,让我们用R的二进制格式保存这些代码: (b)for(i in 1:n){
fA < - paste0(files [i],。rds)
A < - readRDS(fA)

对于(j in i:n){
fB < - paste0(files [j],。rds)
B < - readRDS(fB)
o2 [i,j (平均(A * B))/ sqrt(平均(B * B))/ sqrt(平均(A * A))* sqrt(平均(B * B))) })
#user system已经过了
#2.42 0.39 2.92

格式是〜10倍快!输出结果是一样的:

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $#$#$#$ [1] TRUE


I've a code that works perfectly for my purpose (it reads some files with a specific pattern, read the matrix within each file and compute something using each filepair...the final output is a matrix that has the same size of the file number) and looks like this:

m<- 100
output<- matrix(0, m, m)

lista<- list.files(pattern = "q")
listan<- as.matrix(lista)
n <- nrow(listan)

for (i in 1:n) {    
AA    <- read.table((listan[i,]), header = FALSE)
A<- as.matrix(AA)
dVarX <- sqrt(mean(A * A))

 for (j in i:n) {
    BB <- read.table ((listan[j,]), header = FALSE)
    B<- as.matrix(BB)
    V <- sqrt (dVarX * (sqrt(mean(B * B))))
    output[i,j] <- (sqrt(mean(A * B))) / V        
 }
}

My problem is that it takes a lot of time (I have about 5000 matrixes, that means 5000x5000 loops). I would like to parallelize, but I need some help! Waiting for your kind suggestions!

Thank you in advance!

Gab

解决方案

The bottleneck is likely reading from disk. Running code in parallel isn't guaranteed to make things faster. In this case, multiple processes attempting to read from the same disk at the same time is likely to be even slower than a single process.

Since your matrices are being written by another R process, you really should save them in R's binary format. You're reading every matrix once and only once, so the only way to make your program faster is to make reading from disk faster.

Here's an example that shows you how much faster it could be:

# make some random data and write it to disk
set.seed(21)
for(i in 0:9) {
  m <- matrix(runif(700*700), 700, 700)
  f <- paste0("f",i)
  write(m, f, 700)              # text format
  saveRDS(m, paste0(f,".rds"))  # binary format
}

# initialize two output objects
m <- 10
o1 <- o2 <- matrix(NA, m, m)

# get list of file names
files <- list.files(pattern="^f[[:digit:]]+$")
n <- length(files)

First, let's run your your code using scan, which is already a lot faster than your current solution with read.table.

system.time({
  for (i in 1:n) {    
    A <- scan(files[i],quiet=TRUE)

    for (j in i:n) {
      B <- scan(files[j],quiet=TRUE)
      o1[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B)))
    }
  }
})
#    user  system elapsed 
#   31.37    0.78   32.58

Now, let's re-run that code using the files saved in R's binary format:

system.time({
  for (i in 1:n) {    
    fA <- paste0(files[i],".rds")
    A <- readRDS(fA)

    for (j in i:n) {
      fB <- paste0(files[j],".rds")
      B <- readRDS(fB)
      o2[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B)))
    }
  }
})
#    user  system elapsed 
#    2.42    0.39    2.92

So the binary format is ~10x faster! And the output is the same:

all.equal(o1,o2)
# [1] TRUE

这篇关于并行化并加速R代码以读取许多文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆