如何在python sklearn中正确地使用Union数字和文本特征 [英] how to featureUnion numerical and text features in python sklearn properly

查看:46
本文介绍了如何在python sklearn中正确地使用Union数字和文本特征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我第一次尝试在 sklearn 管道中使用 featureunion 来组合数字(2 列)和文本特征(1 列)以进行多类分类.

I'm trying to use featureunion for the 1st time in sklearn pipeline to combine numerical (2 columns) and text features (1 column) for multi-class classification.

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import FeatureUnion

get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['num1','num2']], validate=False)

process_and_join_features = FeatureUnion(
         [
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('clf', OneVsRestClassifier(LogisticRegression()))
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer()),
                ('clf', OneVsRestClassifier(LogisticRegression()))
            ]))
         ]
    )

在此代码中,'text' 是文本列,'num1'、'num2' 是 2 个数字列.

In this code 'text' is the text columns and 'num1','num2' are 2 numeric column.

错误信息是

TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None,
 steps=[('selector', FunctionTransformer(accept_sparse=False,
      func=<function <lambda> at 0x7fefa8efd840>, inv_kw_args=None,
      inverse_func=None, kw_args=None, pass_y='deprecated',
      validate=False)), ('clf', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weigh...=None, solver='liblinear', tol=0.0001,
      verbose=0, warm_start=False),
      n_jobs=1))])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't

我错过了什么步骤?

推荐答案

FeatureUnion 应该用作管道中的一个步骤,而不是围绕管道.你得到的错误是因为你有一个分类器不是最后一步 - 联合尝试在所有转换器上调用 fittransform 并且分类器没有 fittransform代码>转换方法.

A FeatureUnion should be used as a step in the pipeline, not around the pipeline. The error you are getting is because you have a Classifier not as the final step - the union tries to call fit and transform on all transformers and a classifier does not have a transform method.

简单地重新设计一个带有分类器的外部管道作为最后一步:

Simply rework to have an outer pipeline with the classifier as the final step:

process_and_join_features = Pipeline([
    ('features', FeatureUnion([
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data)
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer())
            ]))
         ])),
    ('clf', OneVsRestClassifier(LogisticRegression()))
])

另请参阅此处,了解 scikit-learn 网站上的一个很好的例子这种事情.

Also see here for a good example on the scikit-learn website doing this sort of thing.

这篇关于如何在python sklearn中正确地使用Union数字和文本特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆