如何获取通过sklearn管道中的特征消除选择的特征名称? [英] How to get feature names selected by feature elimination in sklearn pipeline?

查看:74
本文介绍了如何获取通过sklearn管道中的特征消除选择的特征名称?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在sklearn管道中使用了递归特征消除,该管道如下所示:

I am using recursive feature elimination in my sklearn pipeline, the pipeline looks something like this:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)

pipeline = Pipeline([
    ('features', FeatureUnion([
       ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)), 
       ('custom_features', CustomFeatures())])),
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

如何获取RFE选择的功能的功能名称? RFE应该选择最好的500种功能,但是我真的需要看看选择了哪些功能.

How can I get the feature names of features selected by the RFE? RFE should select the best 500 features, but I really need to take a look at what features have been selected.

我有一个复杂的管道,其中包括多个管道和要素联合,百分位数要素选择以及最后的递归要素消除:

I have a complex Pipeline which consists of multiple pipelines and feature unions, percentile feature selection and at the end Recursive Feature Elimination:

fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90)
fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80)
f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3)

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)

pipeline = Pipeline([
        ('union', FeatureUnion(
                transformer_list=[

                ('vectorized_pipeline', Pipeline([
                    ('union_vectorizer', FeatureUnion([

                        ('stem_text', Pipeline([
                            ('selector', ItemSelector(key='stem_text')),
                            ('stem_tfidf', countVecWord)
                        ])),

                        ('pos_text', Pipeline([
                            ('selector', ItemSelector(key='pos_text')),
                            ('pos_tfidf', countVecWord_tags)
                        ])),

                    ])),
                        ('percentile_feature_selection', fs_vect)
                    ])),


                ('custom_pipeline', Pipeline([
                    ('custom_features', FeatureUnion([

                        ('pos_cluster', Pipeline([
                            ('selector', ItemSelector(key='pos_text')),
                            ('pos_cluster_inner', pos_cluster)
                        ])),

                        ('stylistic_features', Pipeline([
                            ('selector', ItemSelector(key='raw_text')),
                            ('stylistic_features_inner', stylistic_features)
                        ])),


                    ])),
                        ('percentile_feature_selection', fs),
                        ('inner_scale', inner_scaler)
                ])),

                ],

                # weight components in FeatureUnion
                # n_jobs=6,

                transformer_weights={
                    'vectorized_pipeline': 0.8,  # 0.8,
                    'custom_pipeline': 1.0  # 1.0
                },
        )),

        ('rfe_feature_selection', f5),
        ('clf', classifier),
        ])

我将尝试解释这些步骤.第一个管道由矢量化器组成,称为"vectorized_pipeline",所有这些均具有函数"get_feature_names".第二个管道由我自己的功能组成,我还使用fit,transform和get_feature_names函数实现了它们.当我使用@Kevin的建议时,出现一个错误,提示工会"(这是我在管道中顶部元素的名称)不具有get_feature_names函数:

I'll try to explain the steps. The first Pipeline consists of vectorizers and is called "vectorized_pipeline", all of these have a function "get_feature_names". The second Pipeline consists of my own features, I have implemented them with fit, transform and get_feature_names functions as well. When I use the suggestion of @Kevin, I get an error that 'union' (which is the name of my top element in the pipeline) does not have get_feature_names function:

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['union'].get_feature_names()
print np.array(feature_names)[support]

此外,当我尝试从各个FeatureUnions获取功能名称时,如下所示:

Also, when I try to get feature names from individual FeatureUnions, like this:

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names()
print np.array(feature_names)[support]

我收到一个关键错误:

feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names()
KeyError: 'union_vectorizer'

推荐答案

您可以访问

You can access each step of the Pipeline with the attribute named_steps, here's an example on the iris dataset, that only selects 2 features, but the solution will scale.

from sklearn import datasets
from sklearn import feature_selection
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris.data
y = iris.target

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=2, step=1)

pipeline = Pipeline([
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1)
    ])

pipeline.fit(X, y)

使用named_steps,您可以在管道中访问转换对象的属性和方法. RFE 属性support_(或方法get_support())将返回所选功能的布尔掩码:

With named_steps you can access the attributes and methods of the transform object in the pipeline. The RFE attribute support_ (or the method get_support()) will return a boolean mask of the selected features:

support = pipeline.named_steps['rfe_feature_selection'].support_

现在support是一个数组,您可以使用该数组有效地提取所选要素(列)的名称.确保您的功能名称在 numpy array ,而不是python列表.

Now support is an array, you can use that to efficiently extract the name of your selected features (columns). Make sure your feature names are in a numpy array, not a python list.

import numpy as np
feature_names = np.array(iris.feature_names) # transformed list to array

feature_names[support]

array(['sepal width (cm)', 'petal width (cm)'], 
      dtype='|S17')

编辑

根据我上面的评论,这是您的示例,其中删除了CustomFeautures()函数:

EDIT

Per my comment above, here is your example with the CustomFeautures() function removed:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import numpy as np

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)

pipeline = Pipeline([
    ('features', FeatureUnion([
       ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000))])), 
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['features'].get_feature_names()
np.array(feature_names)[support]

这篇关于如何获取通过sklearn管道中的特征消除选择的特征名称?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆