在管道中使用ColumnTransformer时发生AttributeError [英] AttributeError when using ColumnTransformer into a pipeline

查看：313 发布时间：2020/5/28 0:44:49 python pandas scikit-learn pipeline transformer

本文介绍了在管道中使用ColumnTransformer时发生AttributeError的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我的第一个机器学习项目，也是我第一次使用ColumnTransformer.我的目标是执行数据预处理的两个步骤，并对每个步骤使用ColumnTransformer.

This is my first machine learning project and the first time that I use ColumnTransformer. My aim is to perform two steps of data preprocessing, and use ColumnTransformer for each of them.

第一步，对于某些功能，我想用字符串'missing_value'替换数据框中的缺失值，对于其余功能，将其替换为最常用的值.因此，我使用ColumnTransformer结合了这两个操作，并将数据框的相应列传递给它.

In the first step, I want to replace the missing values in my dataframe with the string 'missing_value' for some features, and the most frequent value for the remaining features. Therefore, I combine these two operations using ColumnTransformer and passing to it the corresponding columns of my dataframe.

在第二步中，我想使用仅预处理的数据并根据功能应用OrdinalEncoder或OneHotEncoder.为此，我再次使用ColumnTransformer.

In the second step, I want to use the just preprocessed data and apply OrdinalEncoder or OneHotEncoder depending on the features. For that I use again ColumnTransformer.

然后我将这两个步骤合并到一个管道中.

I then combine the two steps into a single pipeline.

我正在使用Kaggle房屋价格数据集，我的scikit学习版本为0.20，这是我代码的简化版本:

I am using the Kaggle Houses Price dataset, I have scikit-learn version 0.20 and this is a simplified version of my code:

cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'],                          # Street
               ['missing_value', 'Pave', 'Grvl'],         # Alley
               ['missing_value', 'Fa', 'TA', 'Gd', 'Ex']  # PoolQC
]
cat_columns_onehot = ['MSZoning', 'LandContour']


imputer_cat_pipeline = ColumnTransformer([
        ('imp_miss', SimpleImputer(strategy='constant'), cat_columns_fill_miss),  # fill_value='missing_value' by default
        ('imp_freq', SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
])

encoder_cat_pipeline = ColumnTransformer([
        ('ordinal', OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
        ('pass_ord', OneHotEncoder(), cat_columns_onehot),
])

cat_pipeline = Pipeline([
        ('imp_cat', imputer_cat_pipeline),
        ('cat_encoder', encoder_cat_pipeline),
])

不幸的是，当我将其应用于housing_cat时，我的数据框的子集仅包含分类特征，

Unfortunately, when I apply it to housing_cat, the subset of my dataframe including only categorical features,

cat_pipeline.fit_transform(housing_cat)

我得到了错误:

AttributeError:'numpy.ndarray'对象没有属性'columns'

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

在处理上述异常期间，发生了另一个异常:

During handling of the above exception, another exception occurred:

...

ValueError:仅熊猫数据框支持使用字符串指定列

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

我已经尝试过这种简化的管道，并且可以正常工作:

I have tried this simplified pipeline and it works properly:

new_cat_pipeline = Pipeline([
        ('imp_cat', imputer_cat_pipeline),
        ('onehot', OneHotEncoder()),
])

但是，如果我尝试:

enc_one = ColumnTransformer([
        ('onehot', OneHotEncoder(), cat_columns_onehot),
        ('pass_ord', 'passthrough', cat_columns_ord)
])

new_cat_pipeline = Pipeline([
        ('imp_cat', imputer_cat_pipeline),
        ('onehot_encoder', enc_one),
])

我开始遇到同样的错误.

I start to get the same error.

然后，我怀疑此错误与第二步中使用ColumnTransformer有关，但我实际上并不了解它的来源.我在第二步中识别列的方式与第一步中相同，因此我仍然不清楚为什么只有在第二步中才出现属性错误...

I suspect then that this error is related to the use of ColumnTransformer in the second step, but I do not actually understand where it comes from. The way I identify the columns in the second step is the same as in the first step, so it remains unclear to me why only in the second step I get the Attribute Error...

第2步-一种热编码和分类变量

pandas提供了get_dummies，它返回了熊猫数据框，与ColumnTransfomer不同，它的代码为:

Step 2 - one hot encoding and categorical variables

pandas provides get_dummies, which returns pandas Dataframe, unlike ColumnTransfomer, code for this would be:

encoded = pd.get_dummies(dataframe[['MSZoning', 'LandContour']], drop_first=True)
pd.dropna(['MSZoning', 'LandContour'], axis=columns, inplace=True)
dataframe = dataframe.join(encoded)

对于序数变量及其编码，我建议您查看

For ordinal variables and their encoding I would suggest you to look at this SO answer (unluckily some manual mapping would be needed in this case).

使用values属性从数据框中获取np.array，将其传递到管道中，并从数组中重新创建列和索引，如下所示:

Get np.array from the dataframe using values attribute, pass it through the pipeline and recreate columns and indices from the array like this:

pd.DataFrame(data=your_array, index=np.arange(len(your_array)), columns=["A", "B"])

不过，这种方法有一个警告:您将不知道自定义创建的一键编码列的名称(管道不会为您完成此操作).

There is one caveat of this aprroach though; you will not know the names of custom created one-hot-encoded columns (the pipeline will not do this for you).

此外，您可以从sklearn的转换对象中获取列名(例如，使用categories_属性)，但我认为这样做会中断管道(如果我错了，请纠正我).

Additionally, you could get the names of columns from sklearn's transforming objects (e.g. using categories_ attribute), but I think it would break the pipeline (someone correct me if I'm wrong).

这篇关于在管道中使用ColumnTransformer时发生AttributeError的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在管道中使用ColumnTransformer时发生AttributeError [英] AttributeError when using ColumnTransformer into a pipeline

问题描述

推荐答案

第2步-一种热编码和分类变量

Step 2 - one hot encoding and categorical variables

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在管道中使用ColumnTransformer时发生AttributeError [英] AttributeError when using ColumnTransformer into a pipeline

问题描述

推荐答案

第2步-一种热编码和分类变量

Step 2 - one hot encoding and categorical variables

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭