用于搜索查询的 TF*IDF [英] TF*IDF for Search Queries

查看:26
本文介绍了用于搜索查询的 TF*IDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,所以我一直在关注 TF*IDF 上的这两篇文章,但有点困惑:http://css.dzone.com/articles/machine-learning-text-feature

Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature

基本上,我想创建一个搜索查询,其中包含对多个文档的搜索.我想使用 scikit-learn 工具包以及 Python 的 NLTK 库

Basically, I want to create a search query that contains searches through multiple documents. I would like to use the scikit-learn toolkit as well as the NLTK library for Python

问题是我没有看到两个 TF*IDF 向量来自哪里.我需要一个搜索查询和多个文档进行搜索.我想我会针对每个查询计算每个文档的 TF*IDF 分数并找到它们之间的余弦相似度,然后通过按降序对分数进行排序来对它们进行排名.但是,代码似乎没有提供正确的向量.

The problem is that I don't see where the two TF*IDF vectors come from. I need one search query and multiple documents to search. I figured that I calculate the TF*IDF scores of each document against each query and find the cosine similarity between them, and then rank them by sorting the scores in descending order. However, the code doesn't seem to come up with the right vectors.

每当我将查询减少到只有一个搜索时,它都会返回一个巨大的 0 列表,这真的很奇怪.

Whenever I reduce the query to only one search, it is returning a huge list of 0's which is really strange.

代码如下:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords

train_set = ("The sky is blue.", "The sun is bright.") #Documents
test_set = ("The sun in the sky is bright.") #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)

tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

推荐答案

您将 train_settest_set 定义为元组,但我认为它们应该是列表:

You're defining train_set and test_set as tuples, but I think that they should be lists:

train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query

使用此代码似乎运行良好.

Using this the code seems to run fine.

这篇关于用于搜索查询的 TF*IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆