哪个是并行化余弦距离的最佳方法? [英] Which is the best way to parallelize cosine distance?
问题描述
当我尝试使用大型数据集(约 600,000 行)计算余弦距离时,我的 R 会话在超时后崩溃
My R session crashes after the timeout is exceeded when I try to compute the cosine distance with a large dataset (~600,000 lines)
对于小数据集,我的代码有效,这是一个示例:
For small datasets my code works and this is an example:
library(lsa)
relevant.data <- as.matrix(mtcars)
cosine(t(relevant.data))
我读过这个网站上的一些帖子来并行化余弦函数,但没有运气.
I've read some posts on this website to parallelize cosine function but no luck.
是否存在非常有效的方法?
Does a very efficient method exist?
你推荐 rccp 喜欢这篇文章吗?在R中使用clusterapply的平行余弦距离
Do you suggest rccp like this post? Parallel cosine distance using clusterapply in R
如果计算相关矩阵之类的东西效率低下.你有什么建议?
If computing something like a correlation matrix is inefficient. What do you suggest?
推荐答案
在 Rcpp
中进行编码可能会让您感到满意,您不需要额外的并行化麻烦.下面的示例(但我不知道它会如何在您的系统上执行/遇到实际大小的问题:长度为 1e8 的向量(相当于 10,000 x 10,000 矩阵)占用 763Mb,因此即使存储问题 60 的结果^2 倍(如果我计算正确,则为 2.75Tb)可能很困难......).
Coding it in Rcpp
might buy you enough that you don't need the extra hassle of parallelizing. Example below (but I don't know how it will do on your system/with a real-sized problem: a vector of length 1e8 (equivalent to a 10,000 by 10,000 matrix) takes 763Mb, so even storing the results for a problem 60^2 times larger (=2.75Tb if I've calculated correctly) might be difficult ...).
x <- as.matrix(mtcars)
library(lsa)
来自 lsa
的函数:
cosine(as.matrix(mtcars))
稍微精简的 R 代码:
Slightly stripped-down R code:
cosR <- function(x) {
co <- array(0, c(ncol(x), ncol(x)))
## f <- colnames(x)
## dimnames(co) <- list(f, f)
for (i in 2:ncol(x)) {
for (j in 1:(i - 1)) {
co[i,j] <- crossprod(x[,i], x[,j])/
sqrt(crossprod(x[,i]) * crossprod(x[,j]))
}
}
co <- co + t(co)
diag(co) <- 1
return(as.matrix(co))
}
Rcpp 版本,从此处稍作修改:
Rcpp version, slightly modified from here:
library(Rcpp)
library(RcppArmadillo)
cppFunction(depends='RcppArmadillo',
code="NumericMatrix cosCpp(NumericMatrix Xr) {
int n = Xr.nrow(), k = Xr.ncol();
arma::mat X(Xr.begin(), n, k, false); // reuses memory and avoids extra copy
arma::mat Y = arma::trans(X) * X; // matrix product
arma::mat res = Y / (arma::sqrt(arma::diagvec(Y)) * arma::trans(arma::sqrt(arma::diagvec(Y))));
return Rcpp::wrap(res);
}")
测试相等性:
identical(cosR(x),unname(cosine(x)))
all.equal(cosCpp(x),cosR(x))
library(microbenchmark)
microbenchmark(cosine(x),cosR(x),cosCpp(x))
## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## cosine(x) 460046 1181837 2069604.51 1530719 2528021 8757989 100 b
## cosR(x) 542414 1096448 1915011.12 1331277 2321596 11740233 100 b
## cosCpp(x) 7 12472 35827.76 17999 30556 644551 100 a
Rcpp 版本大约快了 1331277/17999 = 74 倍,并且可能 (?) 也能让您解决内存问题.
The Rcpp version is about 1331277/17999 = 74 times faster, and might (?) get you around memory issues as well.
这篇关于哪个是并行化余弦距离的最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!