TfidfVectorizer的词汇和get_features()之间的区别? [英] Difference between vocabulary and get_features() of TfidfVectorizer?

查看:136
本文介绍了TfidfVectorizer的词汇和get_features()之间的区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Train the vectorizer
text="this is a simple example"
singleTFIDF = TfidfVectorizer(ngram_range=(1,2)).fit([text])
singleTFIDF.vocabulary_ # show the word-matrix position pairs

# Analyse the training string - text
single=singleTFIDF.transform([text])
single.toarray()  

我想为每个值分别关联相应的功能. 现在单身的结构是什么?您如何将单个值的位置映射到特征?

I would like to associate for each value in single the according feature. What is now the structure of single? How could you map the position of a value in single to the feature?

如何解释词汇和get_features()的索引?他们有关系吗?两者都具有根据文档具有索引的功能.那令人困惑吗?

How can I interpret the indices of vocabulary and get_features()? Are they related? Both have the features with indices according to the documentation. That is confusing?

推荐答案

属性 vocabulary _ 输出一个字典,其中所有ngram是字典键,而相应的值是每个ngram的列位置(功能)在tfidf矩阵中.方法 get_feature_names()输出一个列表,其中的ngrams根据每个要素的列位置出现.因此,您可以使用任一方法来确定哪个tfidf列对应于哪个功能.在下面的示例中,使用get_feature_names()的输出命名列,可以轻松地将tfidf矩阵转换为pandas数据帧.还要注意,所有值都被赋予了相等的权重,并且所有权重的平方和等于1.

The attribute vocabulary_ outputs a dictionary in which all ngrams are the dictionary keys and the respective values are the column positions of each ngram (feature) in the tfidf matrix. The method get_feature_names() outputs a list in which the ngrams appear according to the column position of each feature. You can therefore use either to determine which tfidf column corresponds to which feature. In the example below, the tfidf matrix is easily converted to a pandas data frame using the output of get_feature_names() to name the columns. Also note that all values have been given an equal weight and that the sum of the squares of all weights is equal to one.

singleTFIDF.vocabulary_
Out[41]: 
{'this': 5,
 'is': 1,
 'simple': 3,
 'example': 0,
 'this is': 6,
 'is simple': 2,
 'simple example': 4}

singleTFIDF.get_feature_names()
Out[42]: ['example', 'is', 'is simple', 'simple', 'simple example', 'this', 'this is']

import pandas as pd
df = pd.DataFrame(single.toarray(), columns=singleTFIDF.get_feature_names())

df
Out[48]: 
    example        is  is simple    simple  simple example      this   this is
0  0.377964  0.377964   0.377964  0.377964        0.377964  0.377964  0.377964

这篇关于TfidfVectorizer的词汇和get_features()之间的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆