从 FeatureUnion + Pipeline 中获取功能名称 [英] Getting feature names from within a FeatureUnion + Pipeline

查看:16
本文介绍了从 FeatureUnion + Pipeline 中获取功能名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 FeatureUnion 来加入从事件的标题和描述中找到的特征:

I am using a FeatureUnion to join features found from the title and description of events:

union = FeatureUnion(
    transformer_list=[
    # Pipeline for pulling features from the event's title
        ('title', Pipeline([
            ('selector', TextSelector(key='title')),
            ('count', CountVectorizer(stop_words='english')),
        ])),

        # Pipeline for standard bag-of-words model for description
        ('description', Pipeline([
            ('selector', TextSelector(key='description_snippet')),
            ('count', TfidfVectorizer(stop_words='english')),
        ])),
    ],

    transformer_weights ={
            'title': 1.0,
            'description': 0.2
    },
)

但是,调用 union.get_feature_names() 会给我一个错误:转换器标题(管道类型)不提供 get_feature_names."我想看看我的不同 Vectorizers 生成的一些功能.我该怎么做?

However, calling union.get_feature_names() gives me an error: "Transformer title (type Pipeline) does not provide get_feature_names." I'd like to see some of the features that are generated by my different Vectorizers. How do I do this?

推荐答案

这是因为您正在使用名为 TextSelector 的自定义转换器.您是否在 TextSelector 中实现了 get_feature_names?

Its because you are using a custom transfomer called TextSelector. Did you implement get_feature_names in TextSelector?

如果您希望此方法起作用,则必须在自定义转换中实现此方法.

You are going to have to implement this method within your custom transform if you want this to work.

这是一个具体的例子:

from sklearn.datasets import load_boston
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin
import pandas as pd

dat = load_boston()
X = pd.DataFrame(dat['data'], columns=dat['feature_names'])
y = dat['target']

# define first custom transformer
class first_transform(TransformerMixin):
    def transform(self, df):
        return df

    def get_feature_names(self):
        return df.columns.tolist()


class second_transform(TransformerMixin):
    def transform(self, df):
        return df

    def get_feature_names(self):
        return df.columns.tolist()



pipe = Pipeline([
       ('features', FeatureUnion([
                    ('custom_transform_first', first_transform()),
                    ('custom_transform_second', second_transform())
                ])
        )])

>>> pipe.named_steps['features']_.get_feature_names()
['custom_transform_first__CRIM',
 'custom_transform_first__ZN',
 'custom_transform_first__INDUS',
 'custom_transform_first__CHAS',
 'custom_transform_first__NOX',
 'custom_transform_first__RM',
 'custom_transform_first__AGE',
 'custom_transform_first__DIS',
 'custom_transform_first__RAD',
 'custom_transform_first__TAX',
 'custom_transform_first__PTRATIO',
 'custom_transform_first__B',
 'custom_transform_first__LSTAT',
 'custom_transform_second__CRIM',
 'custom_transform_second__ZN',
 'custom_transform_second__INDUS',
 'custom_transform_second__CHAS',
 'custom_transform_second__NOX',
 'custom_transform_second__RM',
 'custom_transform_second__AGE',
 'custom_transform_second__DIS',
 'custom_transform_second__RAD',
 'custom_transform_second__TAX',
 'custom_transform_second__PTRATIO',
 'custom_transform_second__B',
 'custom_transform_second__LSTAT']

请记住,Feature Union 将连接从每个转换器的相应 get_feature_names 发出的两个列表.这就是为什么当您的一个或多个变压器没有此方法时会出现错误的原因.

Keep in mind that Feature Union is going to concatenate the two lists emitted from the respective get_feature_names from each of your transformers. this is why you are getting an error when one or more of your transformers do not have this method.

但是,我可以看到仅此一项并不能解决您的问题,因为管道对象中没有 get_feature_names 方法,并且您有嵌套管道(功能联合中的管道).所以你有两个选择:

However, I can see that this alone will not fix your problem, as Pipeline objects don't have a get_feature_names method in them, and you have nested pipelines (pipelines within Feature Unions.). So you have two options:

  1. 子类 Pipeline 并自己添加 get_feature_names 方法,该方法从链中的最后一个转换器获取功能名称.

  1. Subclass Pipeline and add it get_feature_names method yourself, which gets the feature names from the last transformer in the chain.

您自己从每个转换器中提取功能名称,这将要求您自己从管道中取出这些转换器并对它们调用 get_feature_names.

Extract the feature names yourself from each of the transformers, which will require you to grab those transformers out of the pipeline yourself and call get_feature_names on them.

另外,请记住,许多 sklearn 内置转换器并不在 DataFrame 上运行,而是传递 numpy 数组,所以如果您要将大量转换器链接在一起,请注意它.但我认为这为您提供了足够的信息,让您了解正在发生的事情.

Also, keep in mind that many sklearn built in transformers don't operate on DataFrame but pass numpy arrays around, so just watch out for it if you are going to be chaining lots of transformers together. But I think this gives you enough information to give you an idea of what is happening.

还有一点,请查看 sklearn-pandas.我自己没有使用过,但它可能会为您提供解决方案.

One more thing, have a look at sklearn-pandas. I haven't used it myself but it might provide a solution for you.

这篇关于从 FeatureUnion + Pipeline 中获取功能名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆