使用 sklearn 如何计算文档和查询之间的 tf-idf 余弦相似度? [英] Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?

查看：222 发布时间：2021/7/16 20:03:01 python scikit-learn tf-idf cosine-similarity

本文介绍了使用 sklearn 如何计算文档和查询之间的 tf-idf 余弦相似度?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的目标是输入 3 个查询并找出哪个查询与一组 5 个文档最相似.

My goal is to input 3 queries and find out which query is most similar to a set of 5 documents.

到目前为止，我已经计算了文档的 tf-idf 执行以下操作:

So far I have calculated the tf-idf of the documents doing the following:

from sklearn.feature_extraction.text import TfidfVectorizer

def get_term_frequency_inverse_data_frequency(documents):
    allDocs = []
    for document in documents:
        allDocs.append(nlp.clean_tf_idf_text(document))
    vectorizer = TfidfVectorizer()
    matrix = vectorizer.fit_transform(allDocs)
    return matrix

def get_tf_idf_query_similarity(documents, query):
    tfidf = get_term_frequency_inverse_data_frequency(documents)

我现在遇到的问题是我有文档的 tf-idf 我对查询执行什么操作，以便我可以找到文档的余弦相似度?

The problem I am having is now that I have tf-idf of the documents what operations do I perform on the query so I can find the cosine similarity to the documents?

推荐答案

这是我的建议:

我们不必两次拟合模型.我们可以重用相同的向量化器
文本清理功能可以直接使用preprocessing属性插入TfidfVectorizer.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
docs_tfidf = vectorizer.fit_transform(allDocs)

def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    """
    vectorizer: TfIdfVectorizer model
    docs_tfidf: tfidf vectors for all docs
    query: query doc

    return: cosine similarity between query and all docs
    """
    query_tfidf = vectorizer.transform([query])
    cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    return cosineSimilarities

这篇关于使用 sklearn 如何计算文档和查询之间的 tf-idf 余弦相似度?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 sklearn 如何计算文档和查询之间的 tf-idf 余弦相似度? [英] Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 sklearn 如何计算文档和查询之间的 tf-idf 余弦相似度? [英] Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭