TF * IDF搜索查询 [英] TF*IDF for Search Queries
问题描述
好的,所以我一直关注TF * IDF上的这两篇文章,但有点困惑: http: //css.dzone.com/articles/machine-learning-text-feature
Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature
基本上,我想创建一个搜索查询,其中包含对多个文档的搜索.我想使用scikit-learn工具包以及用于Python的NLTK库
Basically, I want to create a search query that contains searches through multiple documents. I would like to use the scikit-learn toolkit as well as the NLTK library for Python
问题是我看不到这两个TF * IDF向量来自何处.我需要一个搜索查询和多个文档来搜索.我发现我针对每个查询计算了每个文档的TF * IDF分数,找到了它们之间的余弦相似度,然后通过按分数降序对它们进行排名.但是,该代码似乎没有提供正确的向量.
The problem is that I don't see where the two TF*IDF vectors come from. I need one search query and multiple documents to search. I figured that I calculate the TF*IDF scores of each document against each query and find the cosine similarity between them, and then rank them by sorting the scores in descending order. However, the code doesn't seem to come up with the right vectors.
每当我将查询简化为一个搜索时,它都会返回一个庞大的0列表,这确实很奇怪.
Whenever I reduce the query to only one search, it is returning a huge list of 0's which is really strange.
以下是代码:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
train_set = ("The sky is blue.", "The sun is bright.") #Documents
test_set = ("The sun in the sky is bright.") #Query
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()
transformer.fit(testVectorizerArray)
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
推荐答案
您正在将train_set
和test_set
定义为元组,但我认为它们应该是列表:
You're defining train_set
and test_set
as tuples, but I think that they should be lists:
train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query
使用此代码似乎可以正常运行.
Using this the code seems to run fine.
这篇关于TF * IDF搜索查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!