TF * IDF搜索查询 [英] TF*IDF for Search Queries

查看:204
本文介绍了TF * IDF搜索查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,所以我一直关注TF * IDF上的这两篇文章,但有点困惑: http: //css.dzone.com/articles/machine-learning-text-feature

Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature

基本上,我想创建一个搜索查询,其中包含对多个文档的搜索.我想使用scikit-learn工具包以及用于Python的NLTK库

Basically, I want to create a search query that contains searches through multiple documents. I would like to use the scikit-learn toolkit as well as the NLTK library for Python

问题是我看不到这两个TF * IDF向量来自何处.我需要一个搜索查询和多个文档来搜索.我发现我针对每个查询计算了每个文档的TF * IDF分数,找到了它们之间的余弦相似度,然后通过按分数降序对它们进行排名.但是,该代码似乎没有提供正确的向量.

The problem is that I don't see where the two TF*IDF vectors come from. I need one search query and multiple documents to search. I figured that I calculate the TF*IDF scores of each document against each query and find the cosine similarity between them, and then rank them by sorting the scores in descending order. However, the code doesn't seem to come up with the right vectors.

每当我将查询简化为一个搜索时,它都会返回一个庞大的0列表,这确实很奇怪.

Whenever I reduce the query to only one search, it is returning a huge list of 0's which is really strange.

以下是代码:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords

train_set = ("The sky is blue.", "The sun is bright.") #Documents
test_set = ("The sun in the sky is bright.") #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)

tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

推荐答案

您正在将train_settest_set定义为元组,但我认为它们应该是列表:

You're defining train_set and test_set as tuples, but I think that they should be lists:

train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query

使用此代码似乎可以正常运行.

Using this the code seems to run fine.

这篇关于TF * IDF搜索查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆