如何为 scikit-learn 分类器获取信息量最大的特征? [英] How to get most informative features for scikit-learn classifiers?
问题描述
机器学习包中的分类器(如 liblinear 和 nltk)提供了一种方法 show_most_informative_features()
,这对调试功能非常有帮助:
The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features()
, which is really helpful for debugging features:
viagra = None ok : spam = 4.5 : 1.0
hello = True ok : spam = 4.5 : 1.0
hello = None spam : ok = 3.3 : 1.0
viagra = True spam : ok = 3.3 : 1.0
casino = True spam : ok = 2.0 : 1.0
casino = None ok : spam = 1.5 : 1.0
我的问题是是否为 scikit-learn 中的分类器实现了类似的功能.我搜索了文档,但找不到类似的内容.
My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn't find anything the like.
如果还没有这样的功能,有人知道如何获得这些值的解决方法吗?
If there is no such function yet, does somebody know a workaround how to get to those values?
推荐答案
在 larsmans 代码的帮助下,我想出了这个二进制案例的代码:
With the help of larsmans code I came up with this code for the binary case:
def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print " %.4f %-15s %.4f %-15s" % (coef_1, fn_1, coef_2, fn_2)
这篇关于如何为 scikit-learn 分类器获取信息量最大的特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!