sklearn如何从每个主题中获取10个单词 [英] Sklearn how to get the 10 words from each topic
问题描述
我想得到每个话题的前10个词频,在我使用TfidfTransformer之后,我得到:并且类型是scipy.sparse.csr.csr_matrix
I want to get the top 10 frequency of words from each topic, and after I use TfidfTransformer, I get: and the type is scipy.sparse.csr.csr_matrix
但是我不知道如何从每个列表中获得最高的十个,在数据中,(0, ****) 表示 0 列表,直到 (5170, *****) 表示 5170 列表.
But I don't know how to get the highest ten from each list, in the data, (0, ****) means the 0 list, until (5170, *****) means the 5170 list.
我尝试将其转换为 numpy,但失败了.
I've tried to convert it into numpy, but it fails.
(0, 19016) 0.024214182003181053
(0, 28002) 0.03661443306612277
(0, 6710) 0.02292100371816788
(0, 27683) 0.013973969726506812
(0, 27104) 0.02236713272585597
(0, 6889) 0.0403281034949193
.
.
.
(5169, 3236) 0.014432449220428715
(5169, 19134) 0.014346823328868169
(5169, 32915) 0.002047199186262409
(5170, 35899) 0.49931779368675605
(5170, 36444) 0.3479717717856863
(5170, 15014) 0.5608169649159123
推荐答案
您可以使用 TfidfVectorizer
来公开 get_feature_names
方法.转换器没有这种方法,但文档明确指出 Vectorizer
等同于 CountVectorizer
后跟转换器.如果您不想使用它,那么我认为您将在矢量化之前被困在构建查找中.
You can use the TfidfVectorizer
to expose the get_feature_names
method. The transformer doesn't have this method, but the docs clearly state that the Vectorizer
is equivalent to CountVectorizer
followed by the transformer. If you don't want to use this, then I think you're going to be stuck building a lookup before you vectorize.
文档中的 TfidfVectorizer:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
TfidfVectorizer in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
对 TfidfVectorizer
中的 fit_transform
的输出进行排序和切片应该可以正常工作.
to sort and slice the output of fit_transform
from the TfidfVectorizer
normal sparse matrix operations should work.
这篇关于sklearn如何从每个主题中获取10个单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!