如何在python sklearn中正确地使用Union数字和文本特征 [英] how to featureUnion numerical and text features in python sklearn properly
问题描述
我第一次尝试在 sklearn 管道中使用 featureunion 来组合数字(2 列)和文本特征(1 列)以进行多类分类.
I'm trying to use featureunion for the 1st time in sklearn pipeline to combine numerical (2 columns) and text features (1 column) for multi-class classification.
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import FeatureUnion
get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['num1','num2']], validate=False)
process_and_join_features = FeatureUnion(
[
('numeric_features', Pipeline([
('selector', get_numeric_data),
('clf', OneVsRestClassifier(LogisticRegression()))
])),
('text_features', Pipeline([
('selector', get_text_data),
('vec', CountVectorizer()),
('clf', OneVsRestClassifier(LogisticRegression()))
]))
]
)
在此代码中,'text' 是文本列,'num1'、'num2' 是 2 个数字列.
In this code 'text' is the text columns and 'num1','num2' are 2 numeric column.
错误信息是
TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None,
steps=[('selector', FunctionTransformer(accept_sparse=False,
func=<function <lambda> at 0x7fefa8efd840>, inv_kw_args=None,
inverse_func=None, kw_args=None, pass_y='deprecated',
validate=False)), ('clf', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weigh...=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False),
n_jobs=1))])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't
我错过了什么步骤?
推荐答案
FeatureUnion
应该用作管道中的一个步骤,而不是围绕管道.你得到的错误是因为你有一个分类器不是最后一步 - 联合尝试在所有转换器上调用 fit
和 transform
并且分类器没有 fit
和 transform
代码>转换代码>方法.
A FeatureUnion
should be used as a step in the pipeline, not around the pipeline. The error you are getting is because you have a Classifier not as the final step - the union tries to call fit
and transform
on all transformers and a classifier does not have a transform
method.
简单地重新设计一个带有分类器的外部管道作为最后一步:
Simply rework to have an outer pipeline with the classifier as the final step:
process_and_join_features = Pipeline([
('features', FeatureUnion([
('numeric_features', Pipeline([
('selector', get_numeric_data)
])),
('text_features', Pipeline([
('selector', get_text_data),
('vec', CountVectorizer())
]))
])),
('clf', OneVsRestClassifier(LogisticRegression()))
])
另请参阅此处,了解 scikit-learn 网站上的一个很好的例子这种事情.
Also see here for a good example on the scikit-learn website doing this sort of thing.
这篇关于如何在python sklearn中正确地使用Union数字和文本特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!