用于搜索查询的 TFIDF [英] TFIDF for Search Queries

查看：26 发布时间：2022/1/2 17:45:47 python nlp nltk scikit-learn tf-idf

本文介绍了用于搜索查询的 TF*IDF的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

好的，所以我一直在关注 TF*IDF 上的这两篇文章，但有点困惑:http://css.dzone.com/articles/machine-learning-text-feature

Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature

基本上，我想创建一个搜索查询，其中包含对多个文档的搜索.我想使用 scikit-learn 工具包以及 Python 的 NLTK 库

Basically, I want to create a search query that contains searches through multiple documents. I would like to use the scikit-learn toolkit as well as the NLTK library for Python

问题是我没有看到两个 TF*IDF 向量来自哪里.我需要一个搜索查询和多个文档进行搜索.我想我会针对每个查询计算每个文档的 TF*IDF 分数并找到它们之间的余弦相似度，然后通过按降序对分数进行排序来对它们进行排名.但是，代码似乎没有提供正确的向量.

The problem is that I don't see where the two TF*IDF vectors come from. I need one search query and multiple documents to search. I figured that I calculate the TF*IDF scores of each document against each query and find the cosine similarity between them, and then rank them by sorting the scores in descending order. However, the code doesn't seem to come up with the right vectors.

每当我将查询减少到只有一个搜索时，它都会返回一个巨大的 0 列表，这真的很奇怪.

Whenever I reduce the query to only one search, it is returning a huge list of 0's which is really strange.

代码如下:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords

train_set = ("The sky is blue.", "The sun is bright.") #Documents
test_set = ("The sun in the sky is bright.") #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)

tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

用于搜索查询的 TFIDF [英] TFIDF for Search Queries

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用于搜索查询的 TF*IDF [英] TF*IDF for Search Queries

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

用于搜索查询的 TFIDF [英] TFIDF for Search Queries

登录关闭