使用sklearn为两个不同的列计算单独的tfidf分数 [英] Computing separate tfidf scores for two different columns using sklearn

查看:184
本文介绍了使用sklearn为两个不同的列计算单独的tfidf分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算一组查询与一组每个查询的结果之间的相似度.我想使用tfidf分数和余弦相似度进行此操作.我遇到的问题是我无法弄清楚如何使用两列(在pandas数据框中)生成tfidf矩阵.我已经将两列连接起来,并且工作正常,但是使用起来很尴尬,因为它需要跟踪哪个查询属于哪个结果.我要如何一次计算两列的tfidf矩阵?我正在使用熊猫和sklearn.

I'm trying to compute the similarity between a set of queries and a set a result for each query. I would like to do this using tfidf scores and cosine similarity. The issue that I'm having is that I can't figure out how to generate a tfidf matrix using two columns (in a pandas dataframe). I have concatenated the two columns and it works fine, but it's awkward to use since it needs to keep track of which query belongs to which result. How would I go about calculating a tfidf matrix for two columns at once? I'm using pandas and sklearn.

以下是相关代码:

tf = TfidfVectorizer(analyzer='word', min_df = 0)
tfidf_matrix = tf.fit_transform(df_all['search_term'] + df_all['product_title']) # This line is the issue
feature_names = tf.get_feature_names() 

我正在尝试将df_all ['search_term']和df_all ['product_title']作为参数传递给tf.fit_transform.这显然是行不通的,因为它只是将字符串连接在一起,这使我无法将search_term与product_title进行比较.另外,也许有更好的方法解决此问题吗?

I'm trying to pass df_all['search_term'] and df_all['product_title'] as arguments into tf.fit_transform. This clearly does not work since it just concatenates the strings together which does not allow me to compare the search_term to the product_title. Also, is there maybe a better way of going about this?

推荐答案

通过将所有单词放在一起,您已经有了一个良好的开端.通常,像这样的简单管道就足以产生良好的结果.您可以使用pipelinepreprocessing构建更复杂的要素处理管道.这是如何处理您的数据的方法:

You've made a good start by just putting all the words together; often a simple pipeline such as this will be enough to produce good results. You can build more complex feature processing pipelines using pipeline and preprocessing. Here's how it would work for your data:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

df_all = pd.DataFrame({'search_term':['hat','cat'], 
                       'product_title':['hat stand','cat in hat']})

transformer = FeatureUnion([
                ('search_term_tfidf', 
                  Pipeline([('extract_field',
                              FunctionTransformer(lambda x: x['search_term'], 
                                                  validate=False)),
                            ('tfidf', 
                              TfidfVectorizer())])),
                ('product_title_tfidf', 
                  Pipeline([('extract_field', 
                              FunctionTransformer(lambda x: x['product_title'], 
                                                  validate=False)),
                            ('tfidf', 
                              TfidfVectorizer())]))]) 

transformer.fit(df_all)

search_vocab = transformer.transformer_list[0][1].steps[1][1].get_feature_names() 
product_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names()
vocab = search_vocab + product_vocab

print(vocab)
print(transformer.transform(df_all).toarray())

['cat', 'hat', 'cat', 'hat', 'in', 'stand']

[[ 0.          1.          0.          0.57973867  0.          0.81480247]
 [ 1.          0.          0.6316672   0.44943642  0.6316672   0.        ]]

这篇关于使用sklearn为两个不同的列计算单独的tfidf分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆