如何计算两个文本文档之间的相似度? [英] How to compute the similarity between two text documents?

查看:187
本文介绍了如何计算两个文本文档之间的相似度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究以任何编程语言编写的NLP项目(尽管我会优先选择Python).

I am looking at working on an NLP project, in any programming language (though Python will be my preference).

我要拍摄两个文档并确定它们的相似程度.

I want to take two documents and determine how similar they are.

推荐答案

执行此操作的常用方法是将文档转换为TF-IDF向量,然后计算它们之间的余弦相似度.任何有关信息检索(IR)的教科书都涵盖了这一点.参见特别是. 信息检索简介 ,该在线指南免费提供

The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online.

TF-IDF(和类似的文本转换)在Python包 Gensim scikit-learn .在后一包中,计算余弦相似度就像

TF-IDF (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

或者,如果文档是纯字符串,

or, if the documents are plain strings,

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T 

尽管Gensim可以为此类任务提供更多选择.

though Gensim may have more options for this kind of task.

另请参见此问题.

[免责声明:我参与了scikit-learn TF-IDF实现.]

[Disclaimer: I was involved in the scikit-learn TF-IDF implementation.]

从上方看,pairwise_similarity是一个Scipy 稀疏矩阵呈正方形,行数和列数等于语料库中的文档数.

From above, pairwise_similarity is a Scipy sparse matrix that is square in shape, with the number of rows and columns equal to the number of documents in the corpus.

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

您可以通过.toarray().A将稀疏数组转换为NumPy数组:

You can convert the sparse array to a NumPy array via .toarray() or .A:

>>> pairwise_similarity.toarray()                                                                                                                                                                                                                            
array([[1.        , 0.17668795, 0.27056873, 0.        , 0.        ],
       [0.17668795, 1.        , 0.15439436, 0.        , 0.        ],
       [0.27056873, 0.15439436, 1.        , 0.19635649, 0.16815247],
       [0.        , 0.        , 0.19635649, 1.        , 0.54499756],
       [0.        , 0.        , 0.16815247, 0.54499756, 1.        ]])

假设我们要查找与最终文档最相似的文档"scikit-learn文档是Orange和Blue".该文档在corpus中具有索引4.您可以通过获取该行的argmax来找到最相似文档的索引,但是首先您需要屏蔽1,代表每个文档与其自身的相似性.您可以通过np.fill_diagonal()进行后者,而通过np.nanargmax()进行前者:

Let's say we want to find the document most similar to the final document, "The scikit-learn docs are Orange and Blue". This document has index 4 in corpus. You can find the index of the most similar document by taking the argmax of that row, but first you'll need to mask the 1's, which represent the similarity of each document to itself. You can do the latter through np.fill_diagonal(), and the former through np.nanargmax():

>>> import numpy as np     

>>> arr = pairwise_similarity.toarray()     
>>> np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            

>>> input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
>>> input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
>>> input_idx                                                                                                                                                                                                                                                
4

>>> result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
>>> corpus[result_idx]                                                                                                                                                                                                                                       
'I prefer scikit-learn to Orange'

注意:使用稀疏矩阵的目的是为大型语料库节省大量的空间.词汇.您可以执行以下操作,而不是转换为NumPy数组:

Note: the purpose of using a sparse matrix is to save (a substantial amount of space) for a large corpus & vocabulary. Instead of converting to a NumPy array, you could do:

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

这篇关于如何计算两个文本文档之间的相似度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆