Sklearn带管道的自定义转换器:级联轴的所有输入数组维度必须完全匹配 [英] Sklearn custom transformers with pipeline: all the input array dimensions for the concatenation axis must match exactly

查看:20
本文介绍了Sklearn带管道的自定义转换器:级联轴的所有输入数组维度必须完全匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习sklearn自定义转换器,并阅读有关创建自定义转换器的两种核心方法:

  1. 通过设置从BaseEstimatorTransformerMixin继承的自定义类,或
  2. 通过创建转换方法并将其传递给FunctionTransformer

我想通过实现元矢量器和功能来比较这两种方法:支持CountVectorizerTfidfVectorizer的矢量器,并根据指定的矢量器类型转换输入数据。

但是,当我将这两个工作传递给sklearn.pipeline.Pipeline时,我似乎无法获得它们中的任何一个。我在fit_transform()步骤中收到以下错误消息:

ValueError: all the input array dimensions for the concatenation axis must match 
exactly, but along dimension 0, the array at index 0 has size 6 and the array 
at index 1 has size 1

我的选项1代码(使用自定义类):

class Vectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
        super().__init__()
        self.vectorizer = vectorizer
        self.ngram_range = ngram_range
    def fit(self, X, y=None):
        return self 
    def transform(self, X, y=None):
        X_vect_ = self.vectorizer.fit_transform(X.copy())
        return X_vect_.toarray()

pipe = Pipeline([
    ('column_transformer', ColumnTransformer([
        ('lesson_type_category', OneHotEncoder(), ['Type']),
        ('comment_text_vectorizer', Vectorizer(), ['Text'])],
        remainder='drop')),
    ('model', LogisticRegression())])

param_dict = {'column_transformer__comment_text_vectorizer__vectorizer': 
[CountVectorizer(), TfidfVectorizer()]
}

randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)

和我的选项2的代码(使用FunctionTransformer从函数创建自定义转换器):

def vectorize_text(X, vectorizer: Callable):
    X_vect_ = vectorizer.fit_transform(X)
    return X_vect_.toarray()

vectorizer_transformer = FunctionTransformer(vectorize_text, kw_args={'vectorizer': TfidfVectorizer()})

pipe = Pipeline([
    ('column_transformer', ColumnTransformer([
        ('lesson_type_category', OneHotEncoder(), ['Type']),
        ('comment_text_vectorizer', vectorizer_transformer, ['Text'])],
        remainder='drop')),
    ('model', LogisticRegression())])

param_dict = {'column_transformer__comment_text_vectorizer__kw_args': 
    [{'vectorizer':CountVectorizer()}, {'vectorizer': TfidfVectorizer()}]
}

randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1').fit(X_train, y_train)

导入和示例数据:

import pandas as pd 
from typing import Callable
import sklearn
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV

df = pd.DataFrame([
    ['A99', 'hi i love python very much', 'c', 1],
    ['B07', 'which programming language should i learn', 'b', 0],
    ['A12', 'what is the difference between python django flask', 'b', 1],
    ['A21', 'i want to be a programmer one day', 'c', 0],
    ['B11', 'should i learn java or python', 'b', 1],
    ['C01', 'how much can i earn as a programmer with python', 'a', 0]
], columns=['Src', 'Text', 'Type', 'Target'])

备注:

  • 按照建议in this question,我在向量化后将所有稀疏矩阵转换为稠密数组,您在两种情况下都可以看到:X_vect_.toarray()

推荐答案

问题是CountVectorizerTfidfVectorizer都要求其输入是一维的(而不是二维的)。在这种情况下,ColumnTransformerdoc说明transformers元组的参数columns应作为字符串传递,而不是作为列表传递。

列:字符串、字符串的数组、int的数组、int的数组、bool的数组、Slice或可调用

在第二个轴上为数据编制索引。整数被解释为位置列,而字符串可以按名称引用DataFrame列。应该使用标量字符串或int,其中转换器希望X是类似一维数组的(向量),否则将向转换器传递二维数组。向可调用对象传递输入数据X,并可以返回上面的任何内容。若要按名称或数据类型选择多个列,可以使用make_Column_selector。

因此,以下方法适用于您的情况(即将['Text']更改为'Text')。

class Vectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
        super().__init__()
        self.vectorizer = vectorizer
        self.ngram_range = ngram_range
    def fit(self, X, y=None):
        return self 
    def transform(self, X, y=None):
        X_vect_ = self.vectorizer.fit_transform(X.copy())
        return X_vect_.toarray()

pipe = Pipeline([
    ('column_transformer', ColumnTransformer([
        ('lesson_type_category', OneHotEncoder(handle_unknown='ignore'), ['Type']),
        ('comment_text_vectorizer', Vectorizer(), 'Text')], remainder='drop')),
    ('model', LogisticRegression())])

param_dict = {'column_transformer__comment_text_vectorizer__vectorizer': [CountVectorizer(), TfidfVectorizer()]
}

randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)

您可以使用FunctionTransformer相应地调整该示例。最后,请注意,我必须通过handle_unknown='ignore'OneHotEncoder,以防止在交叉验证的测试阶段(在培训阶段看不到)出现未知类别时出现错误的可能性。

这篇关于Sklearn带管道的自定义转换器:级联轴的所有输入数组维度必须完全匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆