从FeatureUnion +管道中获取特征名称 [英] Getting feature names from within a FeatureUnion + Pipeline

查看:139
本文介绍了从FeatureUnion +管道中获取特征名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用FeatureUnion结合从事件的标题和描述中找到的特征:

I am using a FeatureUnion to join features found from the title and description of events:

union = FeatureUnion(
    transformer_list=[
    # Pipeline for pulling features from the event's title
        ('title', Pipeline([
            ('selector', TextSelector(key='title')),
            ('count', CountVectorizer(stop_words='english')),
        ])),

        # Pipeline for standard bag-of-words model for description
        ('description', Pipeline([
            ('selector', TextSelector(key='description_snippet')),
            ('count', TfidfVectorizer(stop_words='english')),
        ])),
    ],

    transformer_weights ={
            'title': 1.0,
            'description': 0.2
    },
)

但是,调用union.get_feature_names()给我一个错误:变压器标题(管道类型)不提供get_feature_names."我想看看由我的不同Vectorizer生成的一些功能.我该怎么做?

However, calling union.get_feature_names() gives me an error: "Transformer title (type Pipeline) does not provide get_feature_names." I'd like to see some of the features that are generated by my different Vectorizers. How do I do this?

推荐答案

这是因为您使用的是名为TextSelector的自定义Transfomer.您是否在TextSelector中实现了get_feature_names?

Its because you are using a custom transfomer called TextSelector. Did you implement get_feature_names in TextSelector?

如果您希望此方法有效,则必须在自定义转换中实现此方法.

You are going to have to implement this method within your custom transform if you want this to work.

以下是您的具体示例:

from sklearn.datasets import load_boston
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin
import pandas as pd

dat = load_boston()
X = pd.DataFrame(dat['data'], columns=dat['feature_names'])
y = dat['target']

# define first custom transformer
class first_transform(TransformerMixin):
    def transform(self, df):
        return df

    def get_feature_names(self):
        return df.columns.tolist()


class second_transform(TransformerMixin):
    def transform(self, df):
        return df

    def get_feature_names(self):
        return df.columns.tolist()



pipe = Pipeline([
       ('features', FeatureUnion([
                    ('custom_transform_first', first_transform()),
                    ('custom_transform_second', second_transform())
                ])
        )])

>>> pipe.named_steps['features']_.get_feature_names()
['custom_transform_first__CRIM',
 'custom_transform_first__ZN',
 'custom_transform_first__INDUS',
 'custom_transform_first__CHAS',
 'custom_transform_first__NOX',
 'custom_transform_first__RM',
 'custom_transform_first__AGE',
 'custom_transform_first__DIS',
 'custom_transform_first__RAD',
 'custom_transform_first__TAX',
 'custom_transform_first__PTRATIO',
 'custom_transform_first__B',
 'custom_transform_first__LSTAT',
 'custom_transform_second__CRIM',
 'custom_transform_second__ZN',
 'custom_transform_second__INDUS',
 'custom_transform_second__CHAS',
 'custom_transform_second__NOX',
 'custom_transform_second__RM',
 'custom_transform_second__AGE',
 'custom_transform_second__DIS',
 'custom_transform_second__RAD',
 'custom_transform_second__TAX',
 'custom_transform_second__PTRATIO',
 'custom_transform_second__B',
 'custom_transform_second__LSTAT']

请记住,Feature Union将连接从每个变压器的相应get_feature_names发出的两个列表.这就是为什么当一个或多个变压器不具有此方法时出现错误的原因.

Keep in mind that Feature Union is going to concatenate the two lists emitted from the respective get_feature_names from each of your transformers. this is why you are getting an error when one or more of your transformers do not have this method.

但是,我看到这本身并不能解决您的问题,因为Pipeline对象中没有get_feature_names方法,并且您嵌套了管道(Feature Unions中的管道).因此,您有两个选择:

However, I can see that this alone will not fix your problem, as Pipeline objects don't have a get_feature_names method in them, and you have nested pipelines (pipelines within Feature Unions.). So you have two options:

  1. 子类管道并自己添加get_feature_names方法,该方法从链中的最后一个转换器获取特征名称.

  1. Subclass Pipeline and add it get_feature_names method yourself, which gets the feature names from the last transformer in the chain.

从每个转换器中提取您自己的功能名称,这将需要您自己从管道中提取这些转换器,并在其上调用get_feature_names.

Extract the feature names yourself from each of the transformers, which will require you to grab those transformers out of the pipeline yourself and call get_feature_names on them.

此外,请记住,许多内置的sklearn转换器不能在DataFrame上运行,而是传递numpy数组,因此,如果要将很多转换器链接在一起,请小心.但是我认为这可以为您提供足够的信息,让您对正在发生的事情有所了解.

Also, keep in mind that many sklearn built in transformers don't operate on DataFrame but pass numpy arrays around, so just watch out for it if you are going to be chaining lots of transformers together. But I think this gives you enough information to give you an idea of what is happening.

还有一件事,请看 sklearn-pandas .我本人还没有使用过它,但是它可能会为您提供解决方案.

One more thing, have a look at sklearn-pandas. I haven't used it myself but it might provide a solution for you.

这篇关于从FeatureUnion +管道中获取特征名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆