给定新文档重新计算相似度矩阵 [英] Re-calculate similarity matrix given new documents

查看:53
本文介绍了给定新文档重新计算相似度矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个包含文本文档的实验,我需要计算所有文档之间的(余弦)相似度矩阵(用于其他计算).为此,我使用 sklearn 的 TfidfVectorizer:

I'm running an experiment that include text documents that I need to calculate the (cosine) similarity matrix between all of them (to use for another calculation). For that I use sklearn's TfidfVectorizer:

corpus = [doc1, doc2, doc3, doc4]
vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False) 
tfidf = vect.fit_transform(corpus)
similarities = tfidf * tfidf.T
pairwise_similarity_matrix = similarities.A

问题是,在我的实验的每次迭代中,我都会发现新文档,我需要将其添加到我的相似性矩阵中,并且考虑到我正在处理的文档数量(数万和更多) - 非常耗时.

The problem is that with each iteration of my experiment I discover new documents that I need to add to my similarity matrix, and given the number of documents I'm working with (tens of thousands and more) - it is very time consuming.

我希望找到一种方法,只计算新一批文档和现有文档之间的相似度,而不用对整个数据集重新计算.

I wish to find a way to calculate only the similarities between the new batch of documents and the existing ones, without computing it all again one the entire data set.

请注意,我使用的是词频 (tf) 表示,没有使用逆文档频率 (idf),因此理论上我不需要每次都重新计算整个矩阵.

Note that I'm using a term-frequency (tf) representation, without using inverse-document-frequency (idf), so in theory I don't need to re-calculate the whole matrix each time.

推荐答案

好的,我明白了.正如我所说,这个想法是只计算新批次文件和现有文件之间的相似度,它们的相似度不变.问题是用新看到的术语保持 TfidfVectorizer 的词汇表更新.

OK, I got it. The idea is, as I said, to calculate the similarity only between the new batch of files and the existing ones, which their similarity is unchanged. The problem is to keep the TfidfVectorizer's vocabulary updated with the newly seen terms.

解决方案有两个步骤:

  1. 更新词汇表和 tf 矩阵.
  2. 矩阵乘法和堆叠.

这是整个脚本 - 我们首先获得了原始语料库以及经过训练和计算的对象和矩阵:

Here's the whole script - we first got the original corpus and the trained and calculated objects and matrices:

corpus = [doc1, doc2, doc3]
# Build for the first time:
vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False) 
tf_matrix = vect.fit_transform(corpus)
similarities = tf_matrix * tf_matrix.T
similarities_matrix = similarities.A # just for printing

现在,给定新文档:

new_docs_corpus = [docx, docy, docz] # New documents
# Building new vectorizer to create the parsed vocabulary of the new documents:
new_vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False) 
new_vect.fit(new_docs_corpus)

# Merging old and new vocabs:
new_terms_count = 0
for k, v in new_vect.vocabulary_.items():
    if k in vect.vocabulary_.keys():
        continue
    vect.vocabulary_[k] = np.int64(len(vect.vocabulary_)) # important not to assign a simple int
    new_terms_count = new_terms_count + 1
new_vect.vocabulary_ = vect.vocabulary_

# Build new docs represantation using the merged vocabulary:
new_tf_matrix = new_vect.transform(new_docs_corpus)
new_similarities = new_tf_matrix * new_tf_matrix.T

# Get the old tf-matrix with the same dimentions:
if new_terms_count:
    zero_matrix = csr_matrix((tfidf.shape[0],new_terms_count))
    tf_matrix = hstack([tf_matrix, zero_matrix])
# tf_matrix = vect.transform(corpus) # Instead, we just append 0's for the new terms and stack the tf_matrix over the new one, to save time
cross_similarities = new_tf_matrix * tf_matrix.T # Calculate cross-similarities
tf_matrix = vstack([tf_matrix, new_tfidf])
# Stack it all together:
similarities = vstack([hstack([similarities, cross_similarities.T]), hstack([cross_similarities, new_similarities])])
similarities_matrix = similarities.A

# Updating the corpus with the new documents:
corpus = corpus + new_docs_corpus

我们可以通过比较我们得到的计算出的 similarities_matrix 和我们在联合语料库上训练 TfidfVectorizer 时得到的:corpus + new_docs_corpus.

We can check this by comparing the calculated similarities_matrix we got, with the one we get when we train a TfidfVectorizer on the joint corpus: corpus + new_docs_corpus.

正如评论中所讨论的,我们可以做到这一切只是因为我们没有使用 idf(逆文档频率)元素,这将改变给定新文档的现有文档的表示.

As discussed in the the comments, we can do all that only because we are not using the idf (inverse-document-frequency) element, that will change the representation of existing documents given new ones.

这篇关于给定新文档重新计算相似度矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆