在Scikit学习分类器中找到最常用的术语 [英] Find the Most common term in Scikit-learn classifier
问题描述
我正在遵循Scikit中的示例在某些数据集上使用CountVectorizer
.
I'm following the example in Scikit learn docs where CountVectorizer
is used on some dataset.
问题:count_vect.vocabulary_.viewitems()
列出了所有术语及其频率.您如何根据发生次数对它们进行排序?
Question: count_vect.vocabulary_.viewitems()
lists all the terms and their frequencies. How do you sort them by the number of occurances?
sorted( count_vect.vocabulary_.viewitems() )
似乎不起作用.
推荐答案
vocabulary_.viewitems()
实际上并未列出术语及其频率,而是从术语到其索引的映射.频率(每个文档)由fit_transform方法返回,该方法返回一个稀疏(coo)矩阵,其中行是文档,单词是列(列索引通过vocabulary_映射到单词).您可以通过以下方式获取总频率:
vocabulary_.viewitems()
does not in fact list the terms and their frequencies, instead its a mapping from terms to their indexes. The frequencies (per document) are returned by the fit_transform method, which returns a sparse (coo) matrix, where the rows are documents and columns the words (with column indexes mapped to words via vocabulary_). You can get the total frequencies for example by
matrix = count_vect.fit_transform(doc_list)
freqs = zip(count_vect.get_feature_names(), matrix.sum(axis=0))
# sort from largest to smallest
print sorted(freqs, key=lambda x: -x[1])
这篇关于在Scikit学习分类器中找到最常用的术语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!