一种热编码保留用于插补的 NA [英] One Hot Encoding preserve the NAs for imputation

查看：82 发布时间：2021/6/2 22:22:57 python scikit-learn nan missing-data one-hot-encoding

本文介绍了一种热编码保留用于插补的 NA的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 KNN 在 python 中输入分类变量.

I am trying to use KNN for imputing categorical variables in python.

为此，一种典型的方法是对变量进行热编码.但是 sklearn OneHotEncoder() 不处理 NA，因此您需要将它们重命名为创建单独变量的名称.

In order to do so, a typical way is to one hot encode the variables before. However sklearn OneHotEncoder() doesn't handle NAs so you need to rename them to something which creates a seperate variable.

可重现的小例子:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

#Create random pandas with categories to impute
data0 = pd.DataFrame(columns=["1","2"],data = [["A",np.nan],["B","A"],[np.nan,"A"],["A","B"]])

原始数据框:

data0
     1    2
0    A  NaN
1    B    A
2  NaN    A
3    A    B

继续进行一种热编码:

#Rename for sklearn OHE
enc_missing = SimpleImputer(strategy="constant",fill_value="missing")
data1 = enc_missing.fit_transform(data0)
# Perform OHE:
OHE = OneHotEncoder(sparse=False)
data_OHE = OHE.fit_transform(data1)

Data_OHE 现在是一种热编码:

Data_OHE is now one hot encoded:

Data_OHE
array([[1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.]])

但由于单独的失踪"类别 - 我不再需要归咎于任何 nans.

But because of the seperate "missing" category - i dont have any nans to impute anymore.

我想要的一种热编码的输出

array([[1,        0,      np.nan, np.nan],
       [0,        1,        1,       0   ],
       [np.nan, np.nan,     1,       0   ], 
       [1,        0,        0,       1   ]
       ])

这样我就保留了 nans 以备后用.

Such that I keep nans for later imputation.

你知道有什么方法可以做到这一点吗?

Do you know any way to do this?

据我所知，这是在 scikit-learn Github 存储库中讨论的内容这里和这里，即让 OneHotEncoder 自动处理带有 handle_missing 参数，但我不确定他们的工作状态.

From my understanding this is something that has been discussed in the scikit-learn Github repo here and here, i.e. to make OneHotEncoder handle this automatically with a handle_missing argument, but i am unsure of the status of their work.

一种热编码保留用于插补的 NA [英] One Hot Encoding preserve the NAs for imputation

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

一种热编码保留用于插补的 NA [英] One Hot Encoding preserve the NAs for imputation

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭