训练数据的拟合变换和测试数据的变换 [英] fit-transform on training data and transform on test data

查看:55
本文介绍了训练数据的拟合变换和测试数据的变换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法理解 transform()fit_transform() 是如何协同工作的.

I am having trouble understanding how exactly transform() and fit_transform() are working together.

我在训练数据集上调用 fit_transform(),然后在测试集上调用 transform().

I call fit_transform() on my training data set and transform() on my test set afterwards.

但是,如果我在测试集上调用 fit_transform() 会得到不好的结果.

However if I call fit_transform() on the test set I get bad results.

谁能给我解释一下这是如何以及为什么会发生的?

Can anybody give me an explanation how and why this occurs?

推荐答案

让我们举一个转换的例子,sklearn.preprocessing.StandardScaler.

Let's take an example of a transform, sklearn.preprocessing.StandardScaler.

从文档中,这将:

通过去除均值和缩放到单位方差来标准化特征

Standardize features by removing the mean and scaling to unit variance

假设您正在使用如下代码.

Suppose you're working with code like the following.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# X is features, y is label

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

当您调用 StandardScaler.fit(X_train) 时,它所做的是根据 X_train 中的值计算均值和方差.然后调用 .transform() 将通过减去均值并除以方差来转换所有特征.为方便起见,这两个函数调用可以使用 fit_transform() 一步完成.

When you call StandardScaler.fit(X_train), what it does is calculate the mean and variance from the values in X_train. Then calling .transform() will transform all of the features by subtracting the mean and dividing by the variance. For convenience, these two function calls can be done in one step using fit_transform().

您想仅使用训练数据来拟合缩放器的原因是,您不希望使用来自测试数据的信息来偏向模型.

The reason you want to fit the scaler using only the training data is because you don't want to bias your model with information from the test data.

如果您fit() 到您的测试数据,您将计算每个特征的均值和方差.理论上,如果您的测试集和训练集具有相同的分布,这些值可能非常相似,但在实践中通常并非如此.

If you fit() to your test data, you'd compute a new mean and variance for each feature. In theory these values may be very similar if your test and train sets have the same distribution, but in practice this is typically not the case.

相反,您只想使用在训练数据上计算的参数来转换测试数据.

Instead, you want to only transform the test data by using the parameters computed on the training data.

这篇关于训练数据的拟合变换和测试数据的变换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆