Python TF-IDF:更新TF-IDF矩阵的快速方法 [英] Python tf-idf: fast way to update the tf-idf matrix
问题描述
我有一个包含数千行文本的数据集,我的目标是计算tfidf得分,然后计算文档之间的余弦相似度,这就是我在python中使用gensim进行的操作,遵循该教程:
I have a dataset of several thousand rows of text, my target is to calculate the tfidf score and then cosine similarity between documents, this is what I did using gensim in Python followed the tutorial:
dictionary = corpora.Dictionary(dat)
corpus = [dictionary.doc2bow(text) for text in dat]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
index = similarities.MatrixSimilarity(corpus_tfidf)
假设我们已经建立了tfidf矩阵和相似性,当我们收到一个新文档时,我想在现有数据集中查询其最相似的文档.
Let's say we have the tfidf matrix and similarity built, when we have a new document come in, I want to query for its most similar document in our existing dataset.
问题:有什么方法可以更新tf-idf矩阵,这样我就不必将新的文本文档附加到原始数据集中,而再次重新计算整个事情了?
Question: is there any way we can update the tf-idf matrix so that I don't have to append the new text doc to the original dataset and recalculate the whole thing again?
推荐答案
由于没有其他答案,我将发布解决方案.假设我们处于以下情况:
I'll post my solution since there are no other answers. Let's say we are in the following scenario:
import gensim
from gensim import models
from gensim import corpora
from gensim import similarities
from nltk.tokenize import word_tokenize
import pandas as pd
# routines:
text = "I work on natural language processing and I want to figure out how does gensim work"
text2 = "I love computer science and I code in Python"
dat = pd.Series([text,text2])
dat = dat.apply(lambda x: str(x).lower())
dat = dat.apply(lambda x: word_tokenize(x))
dictionary = corpora.Dictionary(dat)
corpus = [dictionary.doc2bow(doc) for doc in dat]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
#Query:
query_text = "I love icecream and gensim"
query_text = query_text.lower()
query_text = word_tokenize(query_text)
vec_bow = dictionary.doc2bow(query_text)
vec_tfidf = tfidf[vec_bow]
如果我们看:
print(vec_bow)
[(0, 1), (7, 1), (12, 1), (15, 1)]
和:
print(tfidf[vec_bow])
[(12, 0.7071067811865475), (15, 0.7071067811865475)]
FYI ID和文档:
FYI id and doc:
print(dictionary.items())
[(0, u'and'),
(1, u'on'),
(8, u'processing'),
(3, u'natural'),
(4, u'figure'),
(5, u'language'),
(9, u'how'),
(7, u'i'),
(14, u'code'),
(19, u'in'),
(2, u'work'),
(16, u'python'),
(6, u'to'),
(10, u'does'),
(11, u'want'),
(17, u'science'),
(15, u'love'),
(18, u'computer'),
(12, u'gensim'),
(13, u'out')]
看起来该查询仅选择了现有术语,并使用预先计算的权重为您提供了tfidf得分.因此,我的解决方法是每周或每天重建模型,因为这样做很快.
Looks like the query only picked up existing terms and using pre-calculated weights to give you the tfidf score. So my workaround is to rebuild the model weekly or daily since it is fast to do so.
这篇关于Python TF-IDF:更新TF-IDF矩阵的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!