使用scikit学习OneHotEncoder时如何处理分类数据中的缺失值(NaN)? [英] How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder?

查看:75
本文介绍了使用scikit学习OneHotEncoder时如何处理分类数据中的缺失值(NaN)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始学习python,以使用机器学习方法为研究项目开发预测模型.我有一个既包含数值数据又包含分类数据的大型数据集.数据集有很多缺失值.我目前正在尝试使用OneHotEncoder对分类功能进行编码.当我读到有关OneHotEncoder的信息时,我的理解是对于缺少值(NaN),OneHotEncoder会将0分配给所有功能类别,例如:

I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have a large dataset comprised of both numerical and categorical data. The dataset has lots of missing values. I am currently trying to encode the categorical features using OneHotEncoder. When I read about OneHotEncoder, my understanding was that for a missing value (NaN), OneHotEncoder would assign 0s to all the feature's categories, as such:

0     Male 
1     Female
2     NaN

应用OneHotEncoder后:

After applying OneHotEncoder:

0     10 
1     01
2     00

但是,在运行以下代码时:

However, when running the following code:

    # Encoding categorical data
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder


    ct = ColumnTransformer([('encoder', OneHotEncoder(handle_unknown='ignore'), [1])],
                           remainder='passthrough')
    obj_df = np.array(ct.fit_transform(obj_df))
    print(obj_df)

我收到错误 ValueError:输入包含NaN

所以我想我以前对OneHotEncoder如何处理缺失值的理解是错误的.有没有办法让我获得上述功能?我知道在编码之前对缺失值进行插值可以解决此问题,但是由于我正在处理医学数据,因此我不愿意这样做,并且担心插值会降低模型的预测准确性.

So I am guessing my previous understanding of how OneHotEncoder handles missing values is wrong. Is there a way for me to get the functionality described above? I know imputing the missing values before encoding will resolve this issue, but I am reluctant to do this as I am dealing with medical data and fear that imputation may decrease the predictive accuracy of my model.

我发现了这个问题,但与答案没有提供有关如何处理NaN值的足够详细的解决方案.

I found this question that is similar but the answer doesn't offer a detailed enough solution on how to deal with the NaN values.

谢谢,让我知道你的想法.

Let me know what your thoughts are, thanks.

推荐答案

您需要在之前输入缺少的值.您可以定义 Pipeline 使用 SimpleImputer 例如在OneHot编码之前设置 most_frequent 策略:

You will need to impute the missing values before. You can define a Pipeline with an imputing step using SimpleImputer setting a most_frequent strategy for instance, prior to the OneHot encoding:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, [0])
    ])


df = pd.DataFrame(['Male', 'Female', np.nan])
preprocessor.fit_transform(df)
array([[0., 1.],
       [1., 0.],
       [1., 0.]])

这篇关于使用scikit学习OneHotEncoder时如何处理分类数据中的缺失值(NaN)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆