Scikit Learn TfidfVectorizer:如何获得具有最高 tf-idf 分数的前 n 个术语 [英] Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

查看:37
本文介绍了Scikit Learn TfidfVectorizer:如何获得具有最高 tf-idf 分数的前 n 个术语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究关键字提取问题.考虑非常普遍的情况

I am working on keyword extraction problem. Consider the very general case

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')

t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they lay looking up among the pleasant leaves, they saw that it was a Plane Tree.

"How useless is the Plane!" said one of them. "It bears no fruit whatever, and only serves to litter the ground with leaves."

"Ungrateful creatures!" said a voice from the Plane Tree. "You lie here in my cooling shade, and yet you say I am useless! Thus ungratefully, O Jupiter, do men receive their blessings!"

Our best blessings are often the least appreciated."""

tfs = tfidf.fit_transform(t.split(" "))
str = 'tree cat travellers fruit jupiter'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()

for col in response.nonzero()[1]:
    print(feature_names[col], ' - ', response[0, col])

这给了我

  (0, 28)   0.443509712811
  (0, 27)   0.517461475101
  (0, 8)    0.517461475101
  (0, 6)    0.517461475101
tree  -  0.443509712811
travellers  -  0.517461475101
jupiter  -  0.517461475101
fruit  -  0.517461475101

哪个好.对于传入的任何新文档,有没有办法获得 tfidf 得分最高的前 n 个术语?

which is good. For any new document that comes in, is there a way to get the top n terms with the highest tfidf score?

推荐答案

你必须做一点点歌舞才能将矩阵变成 numpy 数组,但这应该可以满足你的要求:

You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:

feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting][:n]

这给了我:

array([u'fruit', u'travellers', u'jupiter'], 
  dtype='<U13')

argsort 调用确实很有用,这里是它的文档.我们必须做[::-1] 因为argsort 只支持从小到大排序.我们调用 flatten 将维度减少到 1d,这样排序后的索引就可以用于索引 1d 特征数组.请注意,包括对 flatten 的调用仅在您一次测试一个文档时才有效.

The argsort call is really the useful one, here are the docs for it. We have to do [::-1] because argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flatten will only work if you're testing one document at at time.

另外,在另一个注释中,您的意思是 tfs = tfidf.fit_transform(t.split(" ")) 之类的吗?否则,多行字符串中的每个术语都被视为文档".使用 代替意味着我们实际上正在查看 4 个文档(每行一个),这在您考虑 tfidf 时更有意义.

Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split(" "))? Otherwise, each term in the multiline string is being treated as a "document". Using instead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.

这篇关于Scikit Learn TfidfVectorizer:如何获得具有最高 tf-idf 分数的前 n 个术语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆