如何查看每个单词的tf-idf分数 [英] how to view tf-idf score against each word

查看:123
本文介绍了如何查看每个单词的tf-idf分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解文档中每个单词的 tf-idf 分数.但是,它只返回矩阵中的值,但是我看到了针对每个单词的 tf-idf 分数的一种特定类型的表示形式.

I was trying to know the tf-idf scores of each word in my document. However, it only returns values in the matrix but I see a specific type of representation of tf-idf scores against each word.

我已经使用了处理过的代码,但是我想更改其显示方式:

I have used processed and the code works however I want to change the way it is presented:

代码:

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer

bow_transformer = CountVectorizer(analyzer=text_process).fit(df["comments"].head())
print(len(bow_transformer.vocabulary_))

tfidf_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])
bow_transformer.vocabulary_transformer().fit(message_bow)

message_tfidf = tfidf_transformer.transform(message_bow)

我得到的结果是这样的(39028,01),(1393,1672).但是,我希望结果像

I get the results like this (39028,01),(1393,1672). However, I expect the results to be like

features    tfidf
fruit       0.00344
excellent   0.00289

推荐答案

您可以通过使用以下代码来获得以上结果:

You can achieve the above result by using following code:

def extract_topn_from_vector(feature_names, sorted_items, topn=5):
    """
      get the feature names and tf-idf score of top n items in the doc,                 
      in descending order of scores. 
    """

    # use only top n items from vector.
    sorted_items = sorted_items[:topn]

    results= {} 
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        results[feature_names[idx]] = round(score, 3)

    # return a sorted list of tuples with feature name and tf-idf score as its element(in descending order of tf-idf scores).
    return sorted(results.items(), key=lambda kv: kv[1], reverse=True)

feature_names = count_vect.get_feature_names()
coo_matrix = message_tfidf.tocoo()
tuples = zip(coo_matrix.col, coo_matrix.data)
sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

# extract only the top n elements.
# Here, n is 10.
word_tfidf = extract_topn_from_vector(feature_names, sorted_items, 10)

print("{}  {}".format("features", "tfidf"))  
for k in word_tfidf:
    print("{} - {}".format(k[0], k[1])) 

查看下面的完整代码,以更好地了解上述代码段.以下代码是不言自明的.

Check out the full code below to get a better idea of above code snippet. The below code is self-explanatory.

完整代码:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import string
import nltk
import pandas as pd

data = pd.read_csv('yourfile.csv')

stops = set(stopwords.words("english"))
wl = nltk.WordNetLemmatizer()

def clean_text(text):
    """
      - Remove Punctuations
      - Tokenization
      - Remove Stopwords
      - stemming/lemmatizing
    """
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    tokens = re.split("\W+", text)
    text = [word for word in tokens if word not in stops]
    text = [wl.lemmatize(word) for word in text]
    return text

def extract_topn_from_vector(feature_names, sorted_items, topn=5):
    """
      get the feature names and tf-idf score of top n items in the doc,                 
      in descending order of scores. 
    """

    # use only top n items from vector.
    sorted_items = sorted_items[:topn]

    results= {} 
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        results[feature_names[idx]] = round(score, 3)

    # return a sorted list of tuples with feature name and tf-idf score as its element(in descending order of tf-idf scores).
    return sorted(results.items(), key=lambda kv: kv[1], reverse=True)

count_vect = CountVectorizer(analyzer=clean_text, tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000)                                        
freq_term_matrix = count_vect.fit_transform(data['text_body'])

tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)  

feature_names = count_vect.get_feature_names()

# sample document
doc = 'watched horrid thing TV. Needless say one movies watch see much worse get.'

tf_idf_vector = tfidf.transform(count_vect.transform([doc]))

coo_matrix = tf_idf_vector.tocoo()
tuples = zip(coo_matrix.col, coo_matrix.data)
sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

# extract only the top n elements.
# Here, n is 10.
word_tfidf = extract_topn_from_vector(feature_names,sorted_items,10)

print("{}  {}".format("features", "tfidf"))  
for k in word_tfidf:
    print("{} - {}".format(k[0], k[1])) 

示例输出:

features  tfidf
Needless - 0.515
horrid - 0.501
worse - 0.312
watched - 0.275
TV - 0.272
say - 0.202
watch - 0.199
thing - 0.189
much - 0.177
see - 0.164

这篇关于如何查看每个单词的tf-idf分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆