如何在Sklearn管道中执行Onehotencoding [英] How to do Onehotencoding in Sklearn Pipeline

查看:147
本文介绍了如何在Sklearn管道中执行Onehotencoding的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对我的Pandas数据框的分类变量进行热编码,其中包括分类变量和继续变量.我意识到可以使用pandas .get_dummies()函数轻松完成此操作,但是我需要使用管道,以便稍后可以生成PMML文件.

I am trying to oneHotEncode the categorical variables of my Pandas dataframe, which includes both categorical and continues variables. I realise this can be done easily with the pandas .get_dummies() function, but I need to use a pipeline so I can generate a PMML-file later on.

这是创建映射器的代码.我要编码的类别变量存储在名为假人"的列表中.

This is the code to create a mapper. The categorical variables I would like to encode are stored in a list called 'dummies'.

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

mapper = DataFrameMapper(
    [(d, LabelEncoder()) for d in dummies] +
    [(d, OneHotEncoder()) for d in dummies]
)

这是创建管道的代码,包括映射器和线性回归.

And this is the code to create a pipeline, including the mapper and linear regression.

from sklearn2pmml import PMMLPipeline
from sklearn.linear_model import LinearRegression

lm = PMMLPipeline([("mapper", mapper),
                   ("regressor", LinearRegression())])

当我现在尝试拟合时(以功能"为数据框,而目标"为系列),则出现错误无法将字符串转换为浮点数".

When I now try to fit (with 'features' being a dataframe, and 'targets' a series), it gives an error 'could not convert string to float'.

lm.fit(features, targets)

有人可以帮助我吗?我迫切需要工作流水线,包括数据的预处理……在此先感谢!

Anyone who can help me out? I am desperate for working pipelines including the preprocessing of data... Thanks in advance!

推荐答案

OneHotEncoder不支持字符串功能,并且使用[(d, OneHotEncoder()) for d in dummies]会将其应用于所有假人列.使用LabelBinarizer代替:

OneHotEncoder doesn't support string features, and with [(d, OneHotEncoder()) for d in dummies] you are applying it to all dummies columns. Use LabelBinarizer instead:

mapper = DataFrameMapper(
    [(d, LabelBinarizer()) for d in dummies]
)

另一种选择是将LabelEncoder与第二个OneHotEncoder步骤一起使用.

An alternative would be to use the LabelEncoder with a second OneHotEncoder step.

mapper = DataFrameMapper(
    [(d, LabelEncoder()) for d in dummies]
)

lm = PMMLPipeline([("mapper", mapper),
                   ("onehot", OneHotEncoder()),
                   ("regressor", LinearRegression())])

这篇关于如何在Sklearn管道中执行Onehotencoding的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆