确定每个班级最重要的功能 [英] Determine most important feature per class

查看:62
本文介绍了确定每个班级最重要的功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下一个机器学习问题,其中有20个类和大约7000个稀疏布尔函数.

Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features.

我想弄清楚每堂课20个最独特的功能是什么.换句话说,在特定类中经常使用但在其他类中不使用或很少使用的功能.

I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used.

能做到这一点的好的特征选择算法或启发式算法是什么?

What would be a good feature selection algorithm or heuristic that can do this?

推荐答案

训练Logistic回归多分类器时,训练模型是num_class x num_feature矩阵,称为[i,j]值为类i中特征j的权重.特征的索引与您输入的特征矩阵相同.

When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called the model where its [i,j] value is the weight of feature j in class i. The indices of features are the same as your input feature matrix.

在scikit-learn中,您可以访问模型的参数如果您使用scikit-learn分类算法,则可以通过以下方法找到每个类最重要的功能:

In scikit-learn you can access to the parameters of the model If you use scikit-learn classification algorithms you'll be able to find the most important features per class by:

clf = SGDClassifier(loss='log', alpha=regul, penalty='l1', l1_ratio=0.9, learning_rate='optimal', n_iter=10, shuffle=False, n_jobs=3, fit_intercept=True)
clf.fit(X_train, Y_train)
for i in range(0, clf.coef_.shape[0]):
    top20_indices = np.argsort(clf.coef_[i])[-20:]
    print top20_indices

clf.coef_是包含每个类别中每个特征的权重的矩阵,因此clf.coef_ [0] [2]是第一类中第三特征的权重.如果构建要素矩阵时,您在dic [id] = feature_name词典中跟踪每个要素的索引,则可以使用该词典检索顶部要素的名称.

clf.coef_ is the matrix containing the weight of each feature in each class so clf.coef_[0][2] is the weight of the third feature in the first class. If when you build your feature matrix you keep track of the index of each feature in a dictionary where dic[id] = feature_name you'll be able to retrieve the name of the top feature using that dictionary.

有关更多信息,请参见 scikit-learn文本分类示例

For more information refer to scikit-learn text classification example

这篇关于确定每个班级最重要的功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆