为何sklearn Imputer需要合适? [英] Why does sklearn Imputer need to fit?

查看:108
本文介绍了为何sklearn Imputer需要合适?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在整个机器学习领域,我真的是新手,并且我正在上关于该主题的在线课程.在本课程中,讲师展示了以下代码:

I'm really new in this whole machine learning thing and I'm taking an online course on this subject. In this course, the instructors showed the following piece of code:

imputer = Inputer(missing_values = 'Nan', strategy = 'mean', axis=0)
imputer = Imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

我真的不明白为什么这个不完美的对象需要fit.我的意思是,我只是想通过用列均值替换它们来消除列中的缺失值.据我对编程的了解很少,这是一个非常简单的迭代过程,不需要模型来训练数据就可以完成.

I don't really get why this imputer object needs to fit. I mean, I´m just trying to get rid of missing values in my columns by replacing them with the column mean. From the little I know about programming, this is a pretty simple, iterative procedure, and wouldn´t require a model that has to train on data to be accomplished.

有人可以解释一下这种不完美的事情是如何工作的,为什么需要培训以列均值替换一些缺失的值? 我已经阅读了sci-kit的文档,但是它仅显示了如何使用这些方法,而不是为什么需要它们.

Can someone please explain how this imputer thing works and why it requires training to replace some missing values by the column mean? I have read sci-kit's documentation, but it just shows how to use the methods, and not why they´re required.

谢谢.

推荐答案

Imputer用数据的某些统计信息(例如均值,中位数,...)填充缺失值. 为了避免交叉验证期间的数据泄漏,它在fit期间计算 train 数据的统计信息,将其存储并在 test 数据期间使用, c2>.

The Imputer fills missing values with some statistics (e.g. mean, median, ...) of the data. To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit, stores it and uses it on the test data, during the transform.

from sklearn.preprocessing import Imputer
obj = Imputer(strategy='mean')

obj.fit([[1, 2, 3], [2, 3, 4]])
print(obj.statistics_)
# array([ 1.5,  2.5,  3.5])

X = obj.transform([[4, np.nan, 6], [5, 6, np.nan]])
print(X)
# array([[ 4. ,  2.5,  6. ],
#        [ 5. ,  6. ,  3.5]])

如果您的训练和测试数据相同,则可以使用fit_transform一步完成两个步骤.

You can do both steps in one if your train and test data are identical, using fit_transform.

X = obj.fit_transform([[1, 2, np.nan], [2, 3, 4]])
print(X)
# array([[ 1. ,  2. ,  4. ],
#        [ 2. ,  3. ,  4. ]])

这个数据泄漏问题很重要,因为数据分布可能会从训练数据更改为测试数据,并且您不希望在拟合过程中已经存在测试数据的信息.

This data leakage issue is important, since the data distribution may change from the training data to the testing data, and you don't want the information of the testing data to be already present during the fit.

有关交叉验证的更多信息,请参阅文档.

See the doc for more information about cross-validation.

这篇关于为何sklearn Imputer需要合适?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆