使用sklearn为两个不同的列计算单独的tfidf分数 [英] Computing separate tfidf scores for two different columns using sklearn
问题描述
我正在尝试计算一组查询与一组每个查询的结果之间的相似度.我想使用tfidf分数和余弦相似度进行此操作.我遇到的问题是我无法弄清楚如何使用两列(在pandas数据框中)生成tfidf矩阵.我已经将两列连接起来,并且工作正常,但是使用起来很尴尬,因为它需要跟踪哪个查询属于哪个结果.我要如何一次计算两列的tfidf矩阵?我正在使用熊猫和sklearn.
I'm trying to compute the similarity between a set of queries and a set a result for each query. I would like to do this using tfidf scores and cosine similarity. The issue that I'm having is that I can't figure out how to generate a tfidf matrix using two columns (in a pandas dataframe). I have concatenated the two columns and it works fine, but it's awkward to use since it needs to keep track of which query belongs to which result. How would I go about calculating a tfidf matrix for two columns at once? I'm using pandas and sklearn.
以下是相关代码:
tf = TfidfVectorizer(analyzer='word', min_df = 0)
tfidf_matrix = tf.fit_transform(df_all['search_term'] + df_all['product_title']) # This line is the issue
feature_names = tf.get_feature_names()
我正在尝试将df_all ['search_term']和df_all ['product_title']作为参数传递给tf.fit_transform.这显然是行不通的,因为它只是将字符串连接在一起,这使我无法将search_term与product_title进行比较.另外,也许有更好的方法解决此问题吗?
I'm trying to pass df_all['search_term'] and df_all['product_title'] as arguments into tf.fit_transform. This clearly does not work since it just concatenates the strings together which does not allow me to compare the search_term to the product_title. Also, is there maybe a better way of going about this?
推荐答案
通过将所有单词放在一起,您已经有了一个良好的开端.通常,像这样的简单管道就足以产生良好的结果.您可以使用pipeline
和preprocessing
构建更复杂的要素处理管道.这是如何处理您的数据的方法:
You've made a good start by just putting all the words together; often a simple pipeline such as this will be enough to produce good results. You can build more complex feature processing pipelines using pipeline
and preprocessing
. Here's how it would work for your data:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
df_all = pd.DataFrame({'search_term':['hat','cat'],
'product_title':['hat stand','cat in hat']})
transformer = FeatureUnion([
('search_term_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['search_term'],
validate=False)),
('tfidf',
TfidfVectorizer())])),
('product_title_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['product_title'],
validate=False)),
('tfidf',
TfidfVectorizer())]))])
transformer.fit(df_all)
search_vocab = transformer.transformer_list[0][1].steps[1][1].get_feature_names()
product_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names()
vocab = search_vocab + product_vocab
print(vocab)
print(transformer.transform(df_all).toarray())
['cat', 'hat', 'cat', 'hat', 'in', 'stand']
[[ 0. 1. 0. 0.57973867 0. 0.81480247]
[ 1. 0. 0.6316672 0.44943642 0.6316672 0. ]]
这篇关于使用sklearn为两个不同的列计算单独的tfidf分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!