Scikit学习中的OneHotEncoder和KNNImpute之间的循环循环 [英] Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn
问题描述
我正在使用一个非常简单的数据集.它在分类和数字功能中都缺少一些值.因此,我正在尝试使用sklearn.preprocessing.KNNImpute获得我能得到的最准确的插补.但是,当我运行以下代码时:
I'm working with a really simple dataset. It has some missing values, both in categorical and numeric features. Because of this, I'm trying to use sklearn.preprocessing.KNNImpute to get the most accurate imputation I can. However, when I run the following code:
imputer = KNNImputer(n_neighbors=120)
imputer.fit_transform(x_train)
我收到错误: ValueError:无法将字符串转换为float:'Private'
这很有意义,显然不能处理分类数据.但是当我尝试使用以下命令运行OneHotEncoder时:
That makes sense, it obviously can't handle categorical data. But when I try to run OneHotEncoder with:
encoder = OneHotEncoder(drop="first")
encoder.fit_transform(x_train[categorical_features])
它引发错误: ValueError:输入包含NaN
即使是分类数据,我还是更愿意使用 KNNImpute
,因为如果我只使用 ColumnTransform
并用数字进行插补,我会失去一些准确性.和分类数据分开.有什么方法可以使 OneHotEncoder
忽略这些缺失的值?如果不是,使用 ColumnTransform
或更简单的方法是解决此问题的更好方法吗?
I'd prefer to use KNNImpute
even with the categorical data as I feel like I'd be losing some accuracy if I just use a ColumnTransform
and impute with numeric and categorical data seperately. Is there any way to get OneHotEncoder
to ignore these missing values? If not, is using ColumnTransform
or a simpler imputer a better way of tackling this problem?
预先感谢
推荐答案
在处理 OneHotEncoder
上的缺失值方面存在未解决的问题/PR,但目前尚不清楚这些选项是什么.在此期间,这是一种手动方法.
There are open issues/PRs to handle missing values on OneHotEncoder
, but it's not clear yet what the options would be. In the interim, here's a manual approach.
- 用熊猫或
SimpleImputer
用字符串"missing"填充分类缺失. - 然后使用
OneHotEncoder
. - 使用一次性编码器的
get_feature_names
来标识与每个原始功能相对应的列,尤其是缺少"的列.指示器. 对于每行和每个原始分类特征,当1为缺失"时,则为1.列,用 - 现在应将所有内容设置为运行
KNNImputer
. - 最后,如果需要,可以对插补的分类编码列进行后处理.(简单地四舍五入可能会为您提供分类特征的全零行,但我不认为使用
KNNImputer
可以连续获得多个1.您可以使用argmax代替,以便准确返回一个1.)
np.nan
替换 0;然后删除缺少的指标列.- Fill categorical missings with pandas or
SimpleImputer
with the string "missing". - Use
OneHotEncoder
then. - Use the one-hot encoder's
get_feature_names
to identify the columns corresponding to each original feature, and in particular the "missing" indicator. - For each row and each original categorical feature, when the 1 is in the "missing" column, replace the 0's with
np.nan
; then delete the missing indicator column. - Now everything should be set up to run
KNNImputer
. - Finally, if desired, postprocess the imputed categorical-encoding columns. (Simply rounding might get you an all-zeros row for a categorical feature, but I don't think with
KNNImputer
you could get more than one 1 in a row. You could argmax instead to get back exactly one 1.)
这篇关于Scikit学习中的OneHotEncoder和KNNImpute之间的循环循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!