如何获取通过sklearn管道中的特征消除选择的特征名称? [英] How to get feature names selected by feature elimination in sklearn pipeline?

查看：74 发布时间：2020/5/4 9:35:22 python machine-learning scikit-learn

本文介绍了如何获取通过sklearn管道中的特征消除选择的特征名称?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在sklearn管道中使用了递归特征消除，该管道如下所示:

I am using recursive feature elimination in my sklearn pipeline, the pipeline looks something like this:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)

pipeline = Pipeline([
    ('features', FeatureUnion([
       ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)), 
       ('custom_features', CustomFeatures())])),
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

如何获取RFE选择的功能的功能名称? RFE应该选择最好的500种功能，但是我真的需要看看选择了哪些功能.

How can I get the feature names of features selected by the RFE? RFE should select the best 500 features, but I really need to take a look at what features have been selected.

我有一个复杂的管道，其中包括多个管道和要素联合，百分位数要素选择以及最后的递归要素消除:

I have a complex Pipeline which consists of multiple pipelines and feature unions, percentile feature selection and at the end Recursive Feature Elimination:

fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90)
fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80)
f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3)

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)

pipeline = Pipeline([
        ('union', FeatureUnion(
                transformer_list=[

                ('vectorized_pipeline', Pipeline([
                    ('union_vectorizer', FeatureUnion([

                        ('stem_text', Pipeline([
                            ('selector', ItemSelector(key='stem_text')),
                            ('stem_tfidf', countVecWord)
                        ])),

                        ('pos_text', Pipeline([
                            ('selector', ItemSelector(key='pos_text')),
                            ('pos_tfidf', countVecWord_tags)
                        ])),

                    ])),
                        ('percentile_feature_selection', fs_vect)
                    ])),


                ('custom_pipeline', Pipeline([
                    ('custom_features', FeatureUnion([

                        ('pos_cluster', Pipeline([
                            ('selector', ItemSelector(key='pos_text')),
                            ('pos_cluster_inner', pos_cluster)
                        ])),

                        ('stylistic_features', Pipeline([
                            ('selector', ItemSelector(key='raw_text')),
                            ('stylistic_features_inner', stylistic_features)
                        ])),


                    ])),
                        ('percentile_feature_selection', fs),
                        ('inner_scale', inner_scaler)
                ])),

                ],

                # weight components in FeatureUnion
                # n_jobs=6,

                transformer_weights={
                    'vectorized_pipeline': 0.8,  # 0.8,
                    'custom_pipeline': 1.0  # 1.0
                },
        )),

        ('rfe_feature_selection', f5),
        ('clf', classifier),
        ])

我将尝试解释这些步骤.第一个管道由矢量化器组成，称为"vectorized_pipeline"，所有这些均具有函数"get_feature_names".第二个管道由我自己的功能组成，我还使用fit，transform和get_feature_names函数实现了它们.当我使用@Kevin的建议时，出现一个错误，提示工会"(这是我在管道中顶部元素的名称)不具有get_feature_names函数:

I'll try to explain the steps. The first Pipeline consists of vectorizers and is called "vectorized_pipeline", all of these have a function "get_feature_names". The second Pipeline consists of my own features, I have implemented them with fit, transform and get_feature_names functions as well. When I use the suggestion of @Kevin, I get an error that 'union' (which is the name of my top element in the pipeline) does not have get_feature_names function:

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['union'].get_feature_names()
print np.array(feature_names)[support]

此外，当我尝试从各个FeatureUnions获取功能名称时，如下所示:

Also, when I try to get feature names from individual FeatureUnions, like this:

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names()
print np.array(feature_names)[support]

我收到一个关键错误:

feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names()
KeyError: 'union_vectorizer'

编辑

根据我上面的评论，这是您的示例，其中删除了CustomFeautures()函数:

EDIT

Per my comment above, here is your example with the CustomFeautures() function removed:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import numpy as np

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)

pipeline = Pipeline([
    ('features', FeatureUnion([
       ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000))])), 
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['features'].get_feature_names()
np.array(feature_names)[support]

这篇关于如何获取通过sklearn管道中的特征消除选择的特征名称?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何获取通过sklearn管道中的特征消除选择的特征名称? [英] How to get feature names selected by feature elimination in sklearn pipeline?

问题描述

推荐答案

编辑

EDIT

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何获取通过sklearn管道中的特征消除选择的特征名称? [英] How to get feature names selected by feature elimination in sklearn pipeline?

问题描述

推荐答案

编辑

EDIT

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭