将多个预处理步骤应用于sklearn管道中的列 [英] Apply multiple preprocessing steps to a column in sklearn pipeline
问题描述
我是第一次尝试sklearn管道,并使用Titanic数据集.我想先在 Embarked
中估算缺失值,然后再进行一次热编码.在 Sex
属性中,我只想进行一种热编码.因此,我有以下步骤,其中两个步骤用于 Embarked
.但是,由于 Embarked
列除了输出中所示的一种热编码之外,它仍未按预期工作(列为"S").
如果我一步完成插补和对 Embarked
的一种热编码,它就会按预期工作.
这背后是什么原因,或者我做错了什么?另外,我没有找到与此相关的任何信息.
categorical_cols_impute = ['Embarked']categorical_impute =管道([("mode_impute",SimpleImputer(missing_values = np.nan,strategy ='constant',fill_value ='S')),#("one_hot",OneHotEncoder(sparse = False))])categorical_cols = ['Embarked','Sex']categorical_one_hot =管道([("one_hot",OneHotEncoder(sparse = False))])preprocesor = ColumnTransformer([("cat_impute",categorical_impute,categorical_cols_impute),("cat_one_hot",categorical_one_hot,categorical_cols)],其余=直通")管道=管道([(预处理器",预处理器),#(模型",RandomForestClassifier(random_state = 0))])
ColumnTransformer
转换器是并行应用的,而不是顺序应用的.因此,在您的示例中, Embarked
最终两次出现在转换后的数据中:一次来自第一个转换器,保持其字符串类型,再一次来自第二个转换器,这一次是一次热编码(但不是先插入)!(?)).
因此,只需取消注释启动管道中的第二步,然后从 categorical_cols
中删除 Embarked
.
另请参见用于列列表相交的一致ColumnTransformer (但我认为它不是重复的)./p>
I was trying sklearn pipeline for the first time and using Titanic dataset. I want to first impute missing value in Embarked
and then do one hot encoding. While in Sex
attribute, I just want to do one hot encoding. So, I have the below steps in which two steps are for Embarked
. But it is not working as expected as the Embarked
column remains in addition to its one hot encoding as shown in the output(column having 'S').
If I do imputation and one hot encoding for Embarked
in single step, it is working as expected.
What is the reason behind this or I am doing something wrong? Also, I didn't find any information related to this.
categorical_cols_impute = ['Embarked']
categorical_impute = Pipeline([
("mode_impute", SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='S')),
# ("one_hot", OneHotEncoder(sparse=False))
])
categorical_cols = ['Embarked', 'Sex']
categorical_one_hot = Pipeline([
("one_hot", OneHotEncoder(sparse=False))
])
preprocesor = ColumnTransformer([
("cat_impute", categorical_impute, categorical_cols_impute),
("cat_one_hot", categorical_one_hot, categorical_cols)
], remainder="passthrough")
pipe = Pipeline([
("preprocessor", preprocesor),
# ("model", RandomForestClassifier(random_state=0))
])
ColumnTransformer
transformers are applied in parallel, not sequentially. So in your example, Embarked
ends up in your transformed data twice: once from the first transformer, keeping its string type, and again from the second transformer, this time one-hot encoded (but not imputed first!(?)).
So just uncomment the second step in the embarked pipeline, and remove Embarked
from categorical_cols
.
See also Consistent ColumnTransformer for intersecting lists of columns (but I don't think it's quite a duplicate).
这篇关于将多个预处理步骤应用于sklearn管道中的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!