使用 sklearn 查找具有大量文档的两个文本之间的字符串相似度 [英] Use sklearn to find string similarity between two texts with large group of documents

查看:58
本文介绍了使用 sklearn 查找具有大量文档的两个文本之间的字符串相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定大量文档(例如书名),如何比较不在原始文档集中的两个书名,或者不重新计算整个 TF-IDF 矩阵?

Given a large set of documents (book titles, for example), how to compare two book titles that are not in the original set of documents, or without recomputing the entire TF-IDF matrix?

例如

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

book_titles = ["The blue eagle has landed",
         "I will fly the eagle to the moon",
         "This is not how You should fly",
         "Fly me to the moon and let me sing among the stars",
         "How can I fly like an eagle",
         "Fixing cars and repairing stuff",
         "And a bottle of rum"]

vectorizer = TfidfVectorizer(stop_words='english', norm='l2', sublinear_tf=True)
tfidf_matrix = vectorizer.fit_transform(book_titles) 

要检查第一本书和第二本书标题之间的相似性,可以这样做

To check the similarity between the first and the second book titles, one would do

cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

等等.这考虑到 TF-IDF 将根据矩阵中的所有条目计算,因此权重将与令牌在所有语料库中出现的次数成正比.

and so on. This considers that the TF-IDF will be calculated with respect all the entries in the matrix, so the weights will be proportional to the number of times a token appears in all corpus.

现在假设应该比较两个书名,title1 和 title2,它们不在原始书名集中.可以将这两个标题添加到 book_titles 集合中并在之后进行比较,因此例如朗姆酒"这个词将包括在前一个语料库中的一个:

Let's say now that two titles should be compared, title1 and title2, that are not in the original set of book titles. The two titles can be added to the book_titles collection and compared afterwards, so the word "rum", for example, will be counted including the one in the previous corpus:

title1="The book of rum"
title2="Fly safely with a bottle of rum"
book_titles.append(title1, title2)
tfidf_matrix = vectorizer.fit_transform(book_titles)
index = tfidf_matrix.shape()[0]
cosine_similarity(tfidf_matrix[index-3:index-2], tfidf_matrix[index-2:index-1])

如果文档变得非常大或需要存储在内存中,那么什么是不切实际且非常缓慢的.在这种情况下可以做什么?如果我只在 title1 和 title2 之间进行比较,则不会使用之前的语料库.

what is really impratical and very slow if documents grow very large or need to be stored out of memory. What can be done in this case? If I compare only between title1 and title2, the previous corpus will not be used.

推荐答案

为什么要将它们附加到列表中并重新计算所有内容?就做

Why do you append them to the list and recompute everything? Just do

new_vectors = vectorizer.transform([title1, title2])

这篇关于使用 sklearn 查找具有大量文档的两个文本之间的字符串相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆