如何有效地计算文档流中文档之间的相似度 [英] How to efficiently compute similarity between documents in a stream of documents

查看:86
本文介绍了如何有效地计算文档流中文档之间的相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收集了Text文档(在Node.js中),其中一个文档i表示为单词列表. 考虑到新文档是作为一种文档流出现的,计算这些文档之间相似度的有效方法是什么?

I gather Text documents (in Node.js) where one document i is represented as a list of words. What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of documents?

我目前在每个文档中单词的归一化频率上使用cos相似度.由于可伸缩性问题,我不使用TF-IDF(术语频率,反文档频率),因为我收到的文档越来越多.

I currently use cos-similarity on the Normalized Frequency of the words within each document. I don't use the TF-IDF (Term frequency, Inverse document frequency) because of the scalability issue since I get more and more documents.

我的第一个版本是从当前可用的文档开始,计算一个大的Term-Document矩阵A,然后计算S = A^T x A,以使S(i, j)为(在通过norm(doc(i))norm(doc(j))归一化之后)单词频率分别为doc(i)doc(j)的文档ij之间的cos相似性.

My first version was to start with the currently available documents, compute a big Term-Document matrix A, and then compute S = A^T x A so that S(i, j) is (after normalization by both norm(doc(i)) and norm(doc(j))) the cos-similarity between documents i and j whose word frequencies are respectively doc(i) and doc(j).

获得新文档doc(k)时该怎么办?好吧,我必须计算该文档与之前所有文档的相似度,而无需构建整个矩阵.我可以将doc(k) dot doc(j)的内积用于所有以前的j,结果得到S(k, j),这很棒.

What do I do when I get a new document doc(k)? Well, I have to compute the similarity of this document with all the previous ones, which doesn't require to build a whole matrix. I can just take the inner-product of doc(k) dot doc(j) for all previous j, and that result in S(k, j), which is great.

  1. 在Node.js中计算S真的很长.其实太久了!因此,我决定创建一个C ++模块,该模块可以更快地完成整个过程.确实如此!但是我等不及了,我应该能够使用中间结果.而我的意思是不等它"

  1. Computing S in Node.js is really long. Way too long in fact! So I decided to create a C++ module which would do the whole thing much faster. And it does! But I cannot wait for it, I should be able to use intermediate results. And what I mean by "not wait for it" is both

a.等待计算完成,而且
b.等待矩阵A生成(这是一个很大的).

a. wait for the computation to be done, but also
b. wait for the matrix A to be built (it's a big one).

计算新的S(k, j)可以利用以下事实:文档的单词比所有给定单词的集合(我用来构建整个矩阵A)的单词少.因此,在Node.js中执行此操作看起来更快,从而避免了占用大量额外资源来访问数据.

Computing new S(k, j) can take advantage of the fact that documents have way less words than the set of all the given words (which I use to build the whole matrix A). Thus, it looks faster to do it in Node.js, avoiding a lot of extra-resource to be taken to access the data.

但是还有什么更好的方法吗?

But is there any better way to do that?

注意:我开始计算S的原因是,我可以在可以访问所有数据的Node.js中轻松构建A,然后在C ++中进行矩阵乘法并将其放回Node.js中,这可以大大加快整个过程.但是,现在计算S变得不可行了,它看起来毫无用处.

Note : the reason I started computing S is that I can easily build A in Node.js where I have access to all the data, and then do the matrix multiplication in C++ and get it back in Node.js, which speeds the whole thing a lot. But now that computing S gets impracticable, it looks useless.

注释2 :是的,我不必计算整个S,我只可以计算右上角的元素(或左下角的元素),但这不是问题.时间计算问题不是这个顺序.

Note 2 : yep, I don't have to compute the whole S, I can just compute the upper-right elements (or the lower-left ones), but that's not the issue. The time computation issue is not of that order.

推荐答案

如果今天必须解决它,只需使用来自fasttext或word2vec的预训练单词向量

If one has to solve it today, just use pre-trained word vectors from fasttext or word2vec

这篇关于如何有效地计算文档流中文档之间的相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆