在R中使用clusterapply的平行余弦距离 [英] Parallel cosine distance using clusterapply in R
问题描述
我需要计算向量和大型矩阵(> 1000000行)之间的余弦距离相似度.代码如下.它可以正常工作,但我看不到我的8核计算机的利用率为100%(没有其他东西在运行),并且线性版本"cosine(vecA,matB)"的整体速度非常低.
I need to calculate cosine distance similarity between a vector and large matrix (>1000000 rows). Code is below. It works correctly but I do not see 100% utilization of my 8 core machine (nothing else is running on it) and overall speed up over linear version of "cosine(vecA, matB)" is quite low.
使用8核将速度提高至少5-6倍(如果不是8倍)是否缺少我的窍门?谢谢!
Is there a trick I am missing to speed it up by at least 5-6 times if not 8 times using 8 cores? Thanks!
我看过其他R个并行链接,但找不到答案来解释我做错了.
I have looked at other R parallel links but could not find an answer that will explain what I am doing wrong.
library(parallel)
library(lsa)
cosine.par <- function(cl, vecA, matB){
Blist <- lapply(c(1:ncol(matB)), function(ii) as.vector(matB[,ii,drop=FALSE]))
#print("Parallel Call")
ans <- clusterApply(cl, Blist, cosine, vecA)
do.call(rbind, ans)
}
k=500
vecA=c(1:k)
matB=matrix(rep(c(1:k),1000000), ncol=1000000)
nc <- detectCores()
cl <- makeCluster(rep("localhost", nc))
print(paste(format(Sys.time(),
"%a %b %d %X %Y %Z")))
cosine.par(cl, vecA, matB)
print(paste(format(Sys.time(),
"%a %b %d %X %Y %Z")))
stopCluster(cl)
推荐答案
我认为问题在于您正在执行一百万个微小的任务,这可能是非常低效的.在这种情况下,您可以使用parApply
函数:
I think the problem is that you're executing a million tiny tasks, which can be extremely inefficient. In this case, you can use the parApply
function:
cosine.par <- function(cl, vecA, matB) {
r <- parApply(cl, matB, 2, cosine, vecA)
dim(r) <- c(length(r), 1)
r
}
对我来说,这比您的原始代码快得多,但是当矩阵对于您的计算机而言太大时,您仍然会遇到问题.
This runs much faster for me than your original code, but you will still run into problems when the matrix becomes too big for your machine.
由于您使用的是Mac,因此也可以尝试使用mclapply
:
Since you're using a Mac, you could also try using mclapply
:
cosine.mc <- function(nc, vecA, matB) {
r <- unlist(mclapply(1:nc, function(i) {
n <- ceiling(ncol(matB) / nc)
j <- (n * (i - 1)) + 1
k <- min(n * i, ncol(matB))
apply(matB[,seq(j, k), drop=FALSE], 2, cosine, vecA)
}, mc.cores=nc))
dim(r) <- c(length(r), 1)
r
}
尽管这是非常有效的,但是当使用mclapply
在大型矩阵上进行操作时,我遇到了以下错误.
Although this is quite efficient, I have run into the following error when operating on large matrices with mclapply
.
Error in mcfork() :
unable to fork, possible reason: Cannot allocate memory
如果出现此错误,您将不得不使用更少的内存,更少的工作程序或为计算机添加更多的内存.
If you get this error, you will either have to use less memory, use fewer workers, or add more memory to your computer.
这篇关于在R中使用clusterapply的平行余弦距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!