> 的快速计算R 中的 10^6 余弦向量相似性 [英] Fast computation of > 10^6 cosine vector similarities in R

查看:74
本文介绍了> 的快速计算R 中的 10^6 余弦向量相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了一个约 1600 个文档 x 约 120 个单词的文档术语矩阵.我想计算所有这些向量之间的余弦相似度,但我们谈论的是大约 1,300,000 次比较 [n * (n - 1)/2].

I got a document term matrix of ~1600 documents x ~120 words. I would like to compute the cosine similarity between all these vectors, but we are speaking about ~1,300,000 comparisons [n * (n - 1) / 2].

我在 8 中使用了 parallel::mclapply,但它仍然需要很长时间.

I used parallel::mclapply with 8 but it still takes forever.

您建议使用哪种其他解决方案?

Which other solution do you suggest?

谢谢

推荐答案

这是我的看法.

如果我定义余弦相似度为

If I define cosine similarity as

coss <- function(x) {crossprod(x)/(sqrt(tcrossprod(colSums(x^2))))}

(我认为这与我使用基本 R 函数和经常受到监督的 crossprod 一样快,这是一个小宝石).如果我将它与使用 RCppArmadillo 的 RCpp 函数进行比较(按照@f-privé 的建议稍微更新)

(I think that is about as quickly as I can make it with base R functions and the often overseen crossprod which is a little gem). If I compare it with an RCpp function using RCppArmadillo (slightly updated as suggested by @f-privé)

NumericMatrix cosine_similarity(NumericMatrix x) {
  arma::mat X(x.begin(), x.nrow(), x.ncol(), false);

  // Compute the crossprod                                                                                      
  arma::mat res = X.t() * X;
  int n = x.ncol();
  arma::vec diag(n);
  int i, j;

  for (i=0; i<n; i++) {
    diag(i) = sqrt(res(i,i));
  }

  for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
      res(i, j) /= diag(i)*diag(j);

  return(wrap(res));
}

(这可能会使用 armadillo 库中的一些专门功能进行优化 - 只是想获得一些时序测量).

(this might possibly be optimised with some of the specialized functions in the armadillo library - just wanted to get some timing measurements).

比较这些产量

> XX <- matrix(rnorm(120*1600), ncol=1600)
> microbenchmark::microbenchmark(cosine_similarity(XX), coss(XX), coss2(XX), times=50)
> microbenchmark::microbenchmark(coss(x), coss2(x), cosine_similarity(x), cosine_similarity2(x), coss3(x), times=50)
Unit: milliseconds
                  expr      min       lq     mean   median       uq      max
               coss(x) 173.0975 183.0606 192.8333 187.6082 193.2885 331.9206
              coss2(x) 162.4193 171.3178 183.7533 178.8296 184.9762 319.7934
 cosine_similarity2(x) 169.6075 175.5601 191.4402 181.3405 186.4769 319.8792
 neval cld
    50  a 
    50  b 
    50  a 

这真的没那么糟糕.使用 C++ 计算余弦相似度的增益非常小(@ f-privé 的解决方案最快)所以我猜你的时间问题是由于你正在做什么将文本从单词转换为数字而不是在计算时余弦相似度.在不了解您的具体代码的情况下,我们很难为您提供帮助.

which is really not that bad. The gain in computing the cosine similarity using C++ is super small (with @ f-privé's solution being fastest) so I'm guessing your timing issues are due to what you are doing to convert the text from the words to numbers and not when calculating the cosine similarity. Without knowing more about your specific code it is hard for us to help you.

这篇关于&gt; 的快速计算R 中的 10^6 余弦向量相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆