Scikit-learn 中 OneHotEncoder 和 KNNImpute 之间的循环 [英] Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn

查看:28
本文介绍了Scikit-learn 中 OneHotEncoder 和 KNNImpute 之间的循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个非常简单的数据集.它在分类和数字特征方面都有一些缺失值.因此,我尝试使用 sklearn.preprocessing.KNNImpute 来获得最准确的插补.但是,当我运行以下代码时:

I'm working with a really simple dataset. It has some missing values, both in categorical and numeric features. Because of this, I'm trying to use sklearn.preprocessing.KNNImpute to get the most accurate imputation I can. However, when I run the following code:

imputer = KNNImputer(n_neighbors=120)

imputer.fit_transform(x_train)

我收到错误:ValueError: could not convert string to float: 'Private'

这是有道理的,它显然无法处理分类数据.但是当我尝试使用以下命令运行 OneHotEncoder 时:

That makes sense, it obviously can't handle categorical data. But when I try to run OneHotEncoder with:

encoder = OneHotEncoder(drop="first")

encoder.fit_transform(x_train[categorical_features])

它抛出错误:ValueError: Input contains NaN

我更喜欢使用 KNNImpute 即使是分类数据,因为我觉得如果我只使用 ColumnTransform 并用数字进行估算,我会失去一些准确性和分类数据分开.有没有办法让 OneHotEncoder 忽略这些缺失值?如果没有,使用 ColumnTransform 或更简单的输入器是否是解决此问题的更好方法?

I'd prefer to use KNNImpute even with the categorical data as I feel like I'd be losing some accuracy if I just use a ColumnTransform and impute with numeric and categorical data seperately. Is there any way to get OneHotEncoder to ignore these missing values? If not, is using ColumnTransform or a simpler imputer a better way of tackling this problem?

提前致谢

推荐答案

OneHotEncoder 上有一些未解决的问题/PR 可以处理缺失值,但目前尚不清楚有哪些选项.在此期间,请使用手动方法.

There are open issues/PRs to handle missing values on OneHotEncoder, but it's not clear yet what the options would be. In the interim, here's a manual approach.

  • 用 Pandas 或 SimpleImputer 用字符串missing"填充分类缺失.
  • 然后使用 OneHotEncoder.
  • 使用 one-hot 编码器的 get_feature_names 来识别与每个原始特征相对应的列,尤其是缺失"特征的列.指标.
  • 对于每一行和每个原始分类特征,当 1 在缺失"中时列,用 np.nan 替换 0;然后删除缺失的指标列.
  • 现在一切都应该设置为运行 KNNImputer.
  • 最后,如果需要,对估算的分类编码列进行后处理.(简单的四舍五入可能会让你得到一个全零行,但我不认为使用 KNNImputer 你可以连续得到一个以上的 1.你可以改为 argmax 来准确地返回一 1.)
  • Fill categorical missings with pandas or SimpleImputer with the string "missing".
  • Use OneHotEncoder then.
  • Use the one-hot encoder's get_feature_names to identify the columns corresponding to each original feature, and in particular the "missing" indicator.
  • For each row and each original categorical feature, when the 1 is in the "missing" column, replace the 0's with np.nan; then delete the missing indicator column.
  • Now everything should be set up to run KNNImputer.
  • Finally, if desired, postprocess the imputed categorical-encoding columns. (Simply rounding might get you an all-zeros row for a categorical feature, but I don't think with KNNImputer you could get more than one 1 in a row. You could argmax instead to get back exactly one 1.)

这篇关于Scikit-learn 中 OneHotEncoder 和 KNNImpute 之间的循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆