一种热编码保留用于插补的 NA [英] One Hot Encoding preserve the NAs for imputation

查看:82
本文介绍了一种热编码保留用于插补的 NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 KNN 在 python 中输入分类变量.

I am trying to use KNN for imputing categorical variables in python.

为此,一种典型的方法是对变量进行热编码.但是 sklearn OneHotEncoder() 不处理 NA,因此您需要将它们重命名为创建单独变量的名称.

In order to do so, a typical way is to one hot encode the variables before. However sklearn OneHotEncoder() doesn't handle NAs so you need to rename them to something which creates a seperate variable.

可重现的小例子:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

#Create random pandas with categories to impute
data0 = pd.DataFrame(columns=["1","2"],data = [["A",np.nan],["B","A"],[np.nan,"A"],["A","B"]])

原始数据框:

data0
     1    2
0    A  NaN
1    B    A
2  NaN    A
3    A    B

继续进行一种热编码:

#Rename for sklearn OHE
enc_missing = SimpleImputer(strategy="constant",fill_value="missing")
data1 = enc_missing.fit_transform(data0)
# Perform OHE:
OHE = OneHotEncoder(sparse=False)
data_OHE = OHE.fit_transform(data1) 

Data_OHE 现在是一种热编码:

Data_OHE is now one hot encoded:

Data_OHE
array([[1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.]])

但由于单独的失踪"类别 - 我不再需要归咎于任何 nans.

But because of the seperate "missing" category - i dont have any nans to impute anymore.

我想要的一种热编码的输出

array([[1,        0,      np.nan, np.nan],
       [0,        1,        1,       0   ],
       [np.nan, np.nan,     1,       0   ], 
       [1,        0,        0,       1   ]
       ])

这样我就保留了 nans 以备后用.

Such that I keep nans for later imputation.

你知道有什么方法可以做到这一点吗?

Do you know any way to do this?

据我所知,这是在 scikit-learn Github 存储库中讨论的内容 这里这里,即让 OneHotEncoder 自动处理带有 handle_missing 参数,但我不确定他们的工作状态.

From my understanding this is something that has been discussed in the scikit-learn Github repo here and here, i.e. to make OneHotEncoder handle this automatically with a handle_missing argument, but i am unsure of the status of their work.

推荐答案

OneHotEncoder 中缺失值的处理最终被合并到 PR17317,但它只是将缺失值视为一个新类别(如果我理解正确,则没有其他处理选项).

Handling of missing values in OneHotEncoder ended up getting merged in PR17317, but it operates by just treating the missing values as a new category (no option for other treatments, if I understand correctly).

此答案中描述了一种手动方法.由于上述 PR,第一步现在不是绝对必要的,但也许填充自定义文本会更容易找到该列?

One manual approach is described in this answer. The first step isn't strictly necessary now because of the above PR, but maybe filling with custom text will make it easier to find the column?

这篇关于一种热编码保留用于插补的 NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆