scikits 机器学习中的缺失值 [英] Missing values in scikits machine learning
问题描述
scikit-learn 中是否可能存在缺失值?他们应该如何代表?我找不到任何关于此的文档.
Is it possible to have missing values in scikit-learn ? How should they be represented? I couldn't find any documentation about that.
推荐答案
scikit-learn 根本不支持缺失值.之前在邮件列表上已经讨论过这个问题,但没有尝试实际编写代码来处理它们.
无论你做什么,不要使用 NaN 来编码缺失值,因为许多算法拒绝处理包含 NaN 的样本.
Whatever you do, don't use NaN to encode missing values, since many of the algorithms refuse to handle samples containing NaNs.
以上答案已经过时;scikit-learn 的最新版本有一个类 Imputer
做简单的,每个特征的缺失值插补.您可以向它提供包含 NaN 的数组,以将其替换为相应特征的均值、中值或众数.
The above answer is outdated; the latest release of scikit-learn has a class Imputer
that does simple, per-feature missing value imputation. You can feed it arrays containing NaNs to have those replaced by the mean, median or mode of the corresponding feature.
这篇关于scikits 机器学习中的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!