在Scikit学习分类器中找到最常用的术语 [英] Find the Most common term in Scikit-learn classifier

查看:59
本文介绍了在Scikit学习分类器中找到最常用的术语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在遵循Scikit中的示例在某些数据集上使用CountVectorizer.

I'm following the example in Scikit learn docs where CountVectorizer is used on some dataset.

问题:count_vect.vocabulary_.viewitems()列出了所有术语及其频率.您如何根据发生次数对它们进行排序?

Question: count_vect.vocabulary_.viewitems() lists all the terms and their frequencies. How do you sort them by the number of occurances?

sorted( count_vect.vocabulary_.viewitems() )似乎不起作用.

推荐答案

vocabulary_.viewitems()实际上并未列出术语及其频率,而是从术语到其索引的映射.频率(每个文档)由fit_transform方法返回,该方法返回一个稀疏(coo)矩阵,其中行是文档,单词是列(列索引通过vocabulary_映射到单词).您可以通过以下方式获取总频率:

vocabulary_.viewitems() does not in fact list the terms and their frequencies, instead its a mapping from terms to their indexes. The frequencies (per document) are returned by the fit_transform method, which returns a sparse (coo) matrix, where the rows are documents and columns the words (with column indexes mapped to words via vocabulary_). You can get the total frequencies for example by

matrix = count_vect.fit_transform(doc_list)
freqs = zip(count_vect.get_feature_names(), matrix.sum(axis=0))    
# sort from largest to smallest
print sorted(freqs, key=lambda x: -x[1])

这篇关于在Scikit学习分类器中找到最常用的术语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆