在R中使用clusterapply的平行余弦距离 [英] Parallel cosine distance using clusterapply in R

查看:246
本文介绍了在R中使用clusterapply的平行余弦距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算向量和大型矩阵(> 1000000行)之间的余弦距离相似度.代码如下.它可以正常工作,但我看不到我的8核计算机的利用率为100%(没有其他东西在运行),并且线性版本"cosine(vecA,matB)"的整体速度非常低.

I need to calculate cosine distance similarity between a vector and large matrix (>1000000 rows). Code is below. It works correctly but I do not see 100% utilization of my 8 core machine (nothing else is running on it) and overall speed up over linear version of "cosine(vecA, matB)" is quite low.

使用8核将速度提高至少5-6倍(如果不是8倍)是否缺少我的窍门?谢谢!

Is there a trick I am missing to speed it up by at least 5-6 times if not 8 times using 8 cores? Thanks!

我看过其他R个并行链接,但找不到答案来解释我做错了.

I have looked at other R parallel links but could not find an answer that will explain what I am doing wrong.

library(parallel)

library(lsa)

cosine.par <- function(cl, vecA, matB){

  Blist <- lapply(c(1:ncol(matB)), function(ii)  as.vector(matB[,ii,drop=FALSE]))

  #print("Parallel Call")

  ans <- clusterApply(cl, Blist, cosine, vecA)

 do.call(rbind, ans)

}

k=500

vecA=c(1:k)

matB=matrix(rep(c(1:k),1000000), ncol=1000000)

nc <- detectCores()

cl <- makeCluster(rep("localhost", nc))

print(paste(format(Sys.time(), 
                   "%a %b %d %X %Y %Z")))

cosine.par(cl, vecA, matB)

print(paste(format(Sys.time(), 
                   "%a %b %d %X %Y %Z")))

stopCluster(cl)

推荐答案

我认为问题在于您正在执行一百万个微小的任务,这可能是非常低效的.在这种情况下,您可以使用parApply函数:

I think the problem is that you're executing a million tiny tasks, which can be extremely inefficient. In this case, you can use the parApply function:

cosine.par <- function(cl, vecA, matB) {
  r <- parApply(cl, matB, 2, cosine, vecA)
  dim(r) <- c(length(r), 1)
  r
}

对我来说,这比您的原始代码快得多,但是当矩阵对于您的计算机而言太大时,您仍然会遇到问题.

This runs much faster for me than your original code, but you will still run into problems when the matrix becomes too big for your machine.

由于您使用的是Mac,因此也可以尝试使用mclapply:

Since you're using a Mac, you could also try using mclapply:

cosine.mc <- function(nc, vecA, matB) {
  r <- unlist(mclapply(1:nc, function(i) {
    n <- ceiling(ncol(matB) / nc)
    j <- (n * (i - 1)) + 1
    k <- min(n * i, ncol(matB))
    apply(matB[,seq(j, k), drop=FALSE], 2, cosine, vecA)
  }, mc.cores=nc))
  dim(r) <- c(length(r), 1)
  r
}

尽管这是非常有效的,但是当使用mclapply在大型矩阵上进行操作时,我遇到了以下错误.

Although this is quite efficient, I have run into the following error when operating on large matrices with mclapply.

Error in mcfork() : 
  unable to fork, possible reason: Cannot allocate memory

如果出现此错误,您将不得不使用更少的内存,更少的工作程序或为计算机添加更多的内存.

If you get this error, you will either have to use less memory, use fewer workers, or add more memory to your computer.

这篇关于在R中使用clusterapply的平行余弦距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆