识别区分类别的最有用的词 [英] Identifying the most useful words in differentiating between classes

查看:31
本文介绍了识别区分类别的最有用的词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用 tfidf(Python 中的 tfidfvectorizer)来确定在尝试区分 两个文本类(即正面或负面情绪等)时哪些词最重要?例如,哪些词对识别正类最重要,然后分别来说,哪些词对识别负类最有用?

Is it possible to use tfidf (tfidfvectorizer in Python) to figure out which words are most important when trying to distinguish between two text classes (i.e., positive or negative sentiment, etc.)? For example, which words were most important for identifying the positive class, and then separately, which were most useful for identifying the negative class?

推荐答案

你可以让 scikit learn 帮你完成繁重的工作——在你的二叉树上训练一个随机森林,提取分类器的特征重要性排名并用它来获得最重要的词:

You can let scikit learn do your heavy lifting - train a random forest on your binary tree, extract the classifier's feature importance ranking and use it to get the most important words:

clf = RandomForestClassifier()
clf.fit(data, labels)

importances = clf.feature_importances_
np.argsort(importances)[::-1]

feature_names = vectorizer.get_feature_names()
top_words = []

for i in xrange(100):
    top_words.append(feature_names[indices[i]])

请注意,这只会告诉您最重要的词是什么,而不是它们对每个类别的说明.要说出每个词对每个类别的看法,您可以对各个词进行分类,然后查看它们的分类是什么.

Note that this will only tell you what are the most important words - not what they say for each category. To say what each word say about each class you can classify the individual words and see what is their classification.

另一种选择是获取所有正/负数据样本,从中删除您尝试理解的单词,并查看这如何影响样本的分类.

Another option is to take all positive/negative data samples, remove from them the word you are trying understand and see how this affects the classification of the sample.

这篇关于识别区分类别的最有用的词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆