Sklearn带管道的自定义转换器:级联轴的所有输入数组维度必须完全匹配 [英] Sklearn custom transformers with pipeline: all the input array dimensions for the concatenation axis must match exactly
问题描述
我正在学习sklearn
自定义转换器,并阅读有关创建自定义转换器的两种核心方法:
- 通过设置从
BaseEstimator
和TransformerMixin
继承的自定义类,或 - 通过创建转换方法并将其传递给
FunctionTransformer
。
我想通过实现元矢量器和功能来比较这两种方法:支持CountVectorizer
或TfidfVectorizer
的矢量器,并根据指定的矢量器类型转换输入数据。
但是,当我将这两个工作传递给sklearn.pipeline.Pipeline
时,我似乎无法获得它们中的任何一个。我在fit_transform()
步骤中收到以下错误消息:
ValueError: all the input array dimensions for the concatenation axis must match
exactly, but along dimension 0, the array at index 0 has size 6 and the array
at index 1 has size 1
我的选项1代码(使用自定义类):
class Vectorizer(BaseEstimator, TransformerMixin):
def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
super().__init__()
self.vectorizer = vectorizer
self.ngram_range = ngram_range
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X_vect_ = self.vectorizer.fit_transform(X.copy())
return X_vect_.toarray()
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(), ['Type']),
('comment_text_vectorizer', Vectorizer(), ['Text'])],
remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__vectorizer':
[CountVectorizer(), TfidfVectorizer()]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)
和我的选项2的代码(使用FunctionTransformer
从函数创建自定义转换器):
def vectorize_text(X, vectorizer: Callable):
X_vect_ = vectorizer.fit_transform(X)
return X_vect_.toarray()
vectorizer_transformer = FunctionTransformer(vectorize_text, kw_args={'vectorizer': TfidfVectorizer()})
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(), ['Type']),
('comment_text_vectorizer', vectorizer_transformer, ['Text'])],
remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__kw_args':
[{'vectorizer':CountVectorizer()}, {'vectorizer': TfidfVectorizer()}]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1').fit(X_train, y_train)
导入和示例数据:
import pandas as pd
from typing import Callable
import sklearn
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame([
['A99', 'hi i love python very much', 'c', 1],
['B07', 'which programming language should i learn', 'b', 0],
['A12', 'what is the difference between python django flask', 'b', 1],
['A21', 'i want to be a programmer one day', 'c', 0],
['B11', 'should i learn java or python', 'b', 1],
['C01', 'how much can i earn as a programmer with python', 'a', 0]
], columns=['Src', 'Text', 'Type', 'Target'])
备注:
- 按照建议in this question,我在向量化后将所有稀疏矩阵转换为稠密数组,您在两种情况下都可以看到:
X_vect_.toarray()
。
推荐答案
问题是CountVectorizer
和TfidfVectorizer
都要求其输入是一维的(而不是二维的)。在这种情况下,ColumnTransformer
的doc说明transformers
元组的参数columns
应作为字符串传递,而不是作为列表传递。
列:字符串、字符串的数组、int的数组、int的数组、bool的数组、Slice或可调用
在第二个轴上为数据编制索引。整数被解释为位置列,而字符串可以按名称引用DataFrame列。应该使用标量字符串或int,其中转换器希望X是类似一维数组的(向量),否则将向转换器传递二维数组。向可调用对象传递输入数据X,并可以返回上面的任何内容。若要按名称或数据类型选择多个列,可以使用make_Column_selector。
因此,以下方法适用于您的情况(即将['Text']
更改为'Text'
)。
class Vectorizer(BaseEstimator, TransformerMixin):
def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
super().__init__()
self.vectorizer = vectorizer
self.ngram_range = ngram_range
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X_vect_ = self.vectorizer.fit_transform(X.copy())
return X_vect_.toarray()
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(handle_unknown='ignore'), ['Type']),
('comment_text_vectorizer', Vectorizer(), 'Text')], remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__vectorizer': [CountVectorizer(), TfidfVectorizer()]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)
您可以使用FunctionTransformer
相应地调整该示例。最后,请注意,我必须通过handle_unknown='ignore'
到OneHotEncoder
,以防止在交叉验证的测试阶段(在培训阶段看不到)出现未知类别时出现错误的可能性。
这篇关于Sklearn带管道的自定义转换器:级联轴的所有输入数组维度必须完全匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!