使用sklearn进行多标签特征选择 [英] Multi-label feature selection using sklearn

本文介绍了使用sklearn进行多标签特征选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望使用 sklearn 对多标签数据集执行特征选择.我想获得标签的最后一组特征,然后我将在另一个机器学习包中使用这些特征.我打算使用我在这里看到的方法,分别为每个标签选择相关特征.

I'm looking to perform feature selection with a multi-label dataset using sklearn. I want to get the final set of features across labels, which I will then use in another machine learning package. I was planning to use the method I saw here, which selects relevant features for each label separately.

from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.multiclass import OneVsRestClassifier
clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
                ('svm', LinearSVC())])
multi_clf = OneVsRestClassifier(clf)

然后我计划使用这个提取包含特征的索引,每个标签:

I then plan to extract the indices of the included features, per label, using this:

selected_features = []
for i in multi_clf.estimators_:
    selected_features += list(i.named_steps["chi2"].get_support(indices=True))

现在,我的问题是,如何选择要包含在最终模型中的选定特征?我可以使用每一个独特的特征(包括只与一个标签相关的特征),或者我可以做一些事情来选择与更多标签相关的特征.

Now, my question is, how do I choose which selected features to include in my final model? I could use every unique feature (which would include features that were only relevant for one label), or I could do something to select features that were relevant for more labels.

我最初的想法是创建一个关于选择给定特征的标签数量的直方图,并根据视觉检查确定一个阈值.我担心的是这种方法是主观的.使用 sklearn 为多标签数据集执行特征选择是否有更原则性的方法?

My initial idea is to create a histogram of the number of labels a given feature was selected for, and to identify a threshold based on visual inspection. My concern is that this method is subjective. Is there a more principled way of performing feature selection for multilabel datasets using sklearn?

推荐答案

根据本:

[...] 根据平均值或最大值对特征进行排名所有标签的卡方得分,导致大多数最好的分类器使用较少的特征.

[...] rank features according to the average or the maximum Chi-squared score across all labels, led to most of the best classifiers while using less features.

然后,为了选择一个好的特征子集,您只需要执行(类似)这样的操作:

Then, in order to select a good subset of features you just need to do (something like) this:

from sklearn.feature_selection import chi2, SelectKBest

selected_features = [] 
for label in labels:
    selector = SelectKBest(chi2, k='all')
    selector.fit(X, Y[label])
    selected_features.append(list(selector.scores_))

// MeanCS 
selected_features = np.mean(selected_features, axis=0) > threshold
// MaxCS
selected_features = np.max(selected_features, axis=0) > threshold

注意:在上面的代码中,我假设 X 是某个文本向量化器(文本的矢量化版本)的输出,而 Y 是一个 Pandas 数据框,每个标签有一列(因此我可以选择列 Y[标签]).此外,还有一个阈值变量需要事先确定.

Note: in the code above I'm assuming that X is the output of some text vectorizer (the vectorized version of the texts) and Y is a pandas dataframe with one column per label (so I can select the column Y[label]). Also, there is a threshold variable that should be fixed beforehand.

这篇关于使用sklearn进行多标签特征选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆