将多个预处理步骤应用于sklearn管道中的列 [英] Apply multiple preprocessing steps to a column in sklearn pipeline

查看:36
本文介绍了将多个预处理步骤应用于sklearn管道中的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是第一次尝试sklearn管道,并使用Titanic数据集.我想先在 Embarked 中估算缺失值,然后再进行一次热编码.在 Sex 属性中,我只想进行一种热编码.因此,我有以下步骤,其中两个步骤用于 Embarked .但是,由于 Embarked 列除了输出中所示的一种热编码之外,它仍未按预期工作(列为"S").

如果我一步完成插补和对 Embarked 的一种热编码,它就会按预期工作.

这背后是什么原因,或者我做错了什么?另外,我没有找到与此相关的任何信息.

  categorical_cols_impute = ['Embarked']categorical_impute =管道([("mode_impute",SimpleImputer(missing_values = np.nan,strategy ='constant',fill_value ='S')),#("one_hot",OneHotEncoder(sparse = False))])categorical_cols = ['Embarked','Sex']categorical_one_hot =管道([("one_hot",OneHotEncoder(sparse = False))])preprocesor = ColumnTransformer([("cat_impute",categorical_impute,categorical_cols_impute),("cat_one_hot",categorical_one_hot,categorical_cols)],其余=直通")管道=管道([(预处理器",预处理器),#(模型",RandomForestClassifier(random_state = 0))]) 

解决方案

ColumnTransformer 转换器是并行应用的,而不是顺序应用的.因此,在您的示例中, Embarked 最终两次出现在转换后的数据中:一次来自第一个转换器,保持其字符串类型,再一次来自第二个转换器,这一次是一次热编码(但不是先插入)!(?)).

因此,只需取消注释启动管道中的第二步,然后从 categorical_cols 中删除 Embarked .

另请参见用于列列表相交的一致ColumnTransformer (但我认为它不是重复的)./p>

I was trying sklearn pipeline for the first time and using Titanic dataset. I want to first impute missing value in Embarked and then do one hot encoding. While in Sex attribute, I just want to do one hot encoding. So, I have the below steps in which two steps are for Embarked. But it is not working as expected as the Embarked column remains in addition to its one hot encoding as shown in the output(column having 'S').

If I do imputation and one hot encoding for Embarked in single step, it is working as expected.

What is the reason behind this or I am doing something wrong? Also, I didn't find any information related to this.

categorical_cols_impute = ['Embarked']
categorical_impute = Pipeline([
    ("mode_impute", SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='S')),
#     ("one_hot", OneHotEncoder(sparse=False))
])
categorical_cols = ['Embarked', 'Sex']
categorical_one_hot = Pipeline([
    ("one_hot", OneHotEncoder(sparse=False))
])
preprocesor = ColumnTransformer([
    ("cat_impute", categorical_impute, categorical_cols_impute),
    ("cat_one_hot", categorical_one_hot, categorical_cols)
], remainder="passthrough")
pipe = Pipeline([
    ("preprocessor", preprocesor),
#     ("model", RandomForestClassifier(random_state=0))
])

解决方案

ColumnTransformer transformers are applied in parallel, not sequentially. So in your example, Embarked ends up in your transformed data twice: once from the first transformer, keeping its string type, and again from the second transformer, this time one-hot encoded (but not imputed first!(?)).

So just uncomment the second step in the embarked pipeline, and remove Embarked from categorical_cols.

See also Consistent ColumnTransformer for intersecting lists of columns (but I don't think it's quite a duplicate).

这篇关于将多个预处理步骤应用于sklearn管道中的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆