从FeatureUnion +管道中获取特征名称 [英] Getting feature names from within a FeatureUnion + Pipeline
问题描述
我正在使用FeatureUnion结合从事件的标题和描述中找到的特征:
I am using a FeatureUnion to join features found from the title and description of events:
union = FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the event's title
('title', Pipeline([
('selector', TextSelector(key='title')),
('count', CountVectorizer(stop_words='english')),
])),
# Pipeline for standard bag-of-words model for description
('description', Pipeline([
('selector', TextSelector(key='description_snippet')),
('count', TfidfVectorizer(stop_words='english')),
])),
],
transformer_weights ={
'title': 1.0,
'description': 0.2
},
)
但是,调用union.get_feature_names()
给我一个错误:变压器标题(管道类型)不提供get_feature_names."我想看看由我的不同Vectorizer生成的一些功能.我该怎么做?
However, calling union.get_feature_names()
gives me an error: "Transformer title (type Pipeline) does not provide get_feature_names." I'd like to see some of the features that are generated by my different Vectorizers. How do I do this?
推荐答案
这是因为您使用的是名为TextSelector
的自定义Transfomer.您是否在TextSelector
中实现了get_feature_names
?
Its because you are using a custom transfomer called TextSelector
. Did you implement get_feature_names
in TextSelector
?
如果您希望此方法有效,则必须在自定义转换中实现此方法.
You are going to have to implement this method within your custom transform if you want this to work.
以下是您的具体示例:
from sklearn.datasets import load_boston
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin
import pandas as pd
dat = load_boston()
X = pd.DataFrame(dat['data'], columns=dat['feature_names'])
y = dat['target']
# define first custom transformer
class first_transform(TransformerMixin):
def transform(self, df):
return df
def get_feature_names(self):
return df.columns.tolist()
class second_transform(TransformerMixin):
def transform(self, df):
return df
def get_feature_names(self):
return df.columns.tolist()
pipe = Pipeline([
('features', FeatureUnion([
('custom_transform_first', first_transform()),
('custom_transform_second', second_transform())
])
)])
>>> pipe.named_steps['features']_.get_feature_names()
['custom_transform_first__CRIM',
'custom_transform_first__ZN',
'custom_transform_first__INDUS',
'custom_transform_first__CHAS',
'custom_transform_first__NOX',
'custom_transform_first__RM',
'custom_transform_first__AGE',
'custom_transform_first__DIS',
'custom_transform_first__RAD',
'custom_transform_first__TAX',
'custom_transform_first__PTRATIO',
'custom_transform_first__B',
'custom_transform_first__LSTAT',
'custom_transform_second__CRIM',
'custom_transform_second__ZN',
'custom_transform_second__INDUS',
'custom_transform_second__CHAS',
'custom_transform_second__NOX',
'custom_transform_second__RM',
'custom_transform_second__AGE',
'custom_transform_second__DIS',
'custom_transform_second__RAD',
'custom_transform_second__TAX',
'custom_transform_second__PTRATIO',
'custom_transform_second__B',
'custom_transform_second__LSTAT']
请记住,Feature Union
将连接从每个变压器的相应get_feature_names
发出的两个列表.这就是为什么当一个或多个变压器不具有此方法时出现错误的原因.
Keep in mind that Feature Union
is going to concatenate the two lists emitted from the respective get_feature_names
from each of your transformers. this is why you are getting an error when one or more of your transformers do not have this method.
但是,我看到这本身并不能解决您的问题,因为Pipeline对象中没有get_feature_names
方法,并且您嵌套了管道(Feature Unions中的管道).因此,您有两个选择:
However, I can see that this alone will not fix your problem, as Pipeline objects don't have a get_feature_names
method in them, and you have nested pipelines (pipelines within Feature Unions.). So you have two options:
-
子类管道并自己添加
get_feature_names
方法,该方法从链中的最后一个转换器获取特征名称.
Subclass Pipeline and add it
get_feature_names
method yourself, which gets the feature names from the last transformer in the chain.
从每个转换器中提取您自己的功能名称,这将需要您自己从管道中提取这些转换器,并在其上调用get_feature_names
.
Extract the feature names yourself from each of the transformers, which will require you to grab those transformers out of the pipeline yourself and call get_feature_names
on them.
此外,请记住,许多内置的sklearn转换器不能在DataFrame上运行,而是传递numpy数组,因此,如果要将很多转换器链接在一起,请小心.但是我认为这可以为您提供足够的信息,让您对正在发生的事情有所了解.
Also, keep in mind that many sklearn built in transformers don't operate on DataFrame but pass numpy arrays around, so just watch out for it if you are going to be chaining lots of transformers together. But I think this gives you enough information to give you an idea of what is happening.
还有一件事,请看 sklearn-pandas .我本人还没有使用过它,但是它可能会为您提供解决方案.
One more thing, have a look at sklearn-pandas. I haven't used it myself but it might provide a solution for you.
这篇关于从FeatureUnion +管道中获取特征名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!