在线性回归中比较StandardScaler和Normalizer的结果 [英] Comparing Results from StandardScaler vs Normalizer in Linear Regression

查看:425
本文介绍了在线性回归中比较StandardScaler和Normalizer的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一些不同情况下的线性回归示例,将使用NormalizerStandardScaler的结果进行比较,结果令人困惑.

I'm working through some examples of Linear Regression under different scenarios, comparing the results from using Normalizer and StandardScaler, and the results are puzzling.

我正在使用波士顿住房数据集,并以此方式进行准备:

I'm using the boston housing dataset, and prepping it this way:

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

#load the data
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df['PRICE'] = boston.target

我目前正在尝试从以下情况中得出的结果推论:

I'm currently trying to reason about the results I get from the following scenarios:

  • 使用参数normalize=True与使用Normalizer
  • 初始化线性回归
  • 使用参数fit_intercept = False初始化线性回归,并进行标准化和不进行标准化.
  • Initializing Linear Regression with the parameter normalize=True vs using Normalizer
  • Initializing Linear Regression with the parameter fit_intercept = False with and without standardization.

总的来说,我发现结果令人困惑.

Collectively, I find the results confusing.

这是我设置所有内容的方式:

Here's how I'm setting everything up:

# Prep the data
X = df.iloc[:, :-1]
y = df.iloc[:, -1:]
normal_X = Normalizer().fit_transform(X)
scaled_X = StandardScaler().fit_transform(X)

#now prepare some of the models
reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
reg3 = LinearRegression().fit(normal_X, y)
reg4 = LinearRegression().fit(scaled_X, y)
reg5 = LinearRegression(fit_intercept=False).fit(scaled_X, y)

然后,我创建了3个单独的数据帧,以比较每个模型的R_score,系数值和预测.

Then, I created 3 separate dataframes to compare the R_score, coefficient values, and predictions from each model.

要创建数据框以比较每个模型的系数值,我做了以下操作:

To create the dataframe to compare coefficient values from each model, I did the following:

#Create a dataframe of the coefficients
coef = pd.DataFrame({
    'coeff':                       reg1.coef_[0],
    'coeff_normalize_true':        reg2.coef_[0],
    'coeff_normalizer':            reg3.coef_[0],
    'coeff_scaler':                reg4.coef_[0],
    'coeff_scaler_no_int':         reg5.coef_[0]
})

这是我创建数据框以比较每个模型的R ^ 2值的方法:

Here's how I created the dataframe to compare the R^2 values from each model:

scores = pd.DataFrame({
    'score':                        reg1.score(X, y),
    'score_normalize_true':         reg2.score(X, y),
    'score_normalizer':             reg3.score(normal_X, y),
    'score_scaler':                 reg4.score(scaled_X, y),
    'score_scaler_no_int':          reg5.score(scaled_X, y)
    }, index=range(1)
)

最后,这是比较每个预测的数据框:

Lastly, here's the dataframe that compares the predictions from each:

predictions = pd.DataFrame({
    'pred':                        reg1.predict(X).ravel(),
    'pred_normalize_true':         reg2.predict(X).ravel(),
    'pred_normalizer':             reg3.predict(normal_X).ravel(),
    'pred_scaler':                 reg4.predict(scaled_X).ravel(),
    'pred_scaler_no_int':          reg5.predict(scaled_X).ravel()
}, index=range(len(y)))

这是结果数据框:

系数:

得分:

预测:

我有三个我无法调和的问题:

I have three questions that I can't reconcile:

  1. 为什么前两个模型之间完全没有差异?似乎设置normalize=False无效.我可以理解预测值和R ^ 2值相同,但是我的特征具有不同的数值范围,所以我不确定为什么归一化根本没有效果.当您考虑使用StandardScaler会极大地改变系数时,这会令人困惑.
  2. 我不明白为什么使用Normalizer的模型会导致与其他模型如此根本不同的系数值,尤其是当使用LinearRegression(normalize=True)的模型完全没有变化时.
  1. Why is there absolutely no difference between the first two models? It appears that setting normalize=False does nothing. I can understand having predictions and R^2 values that are the same, but my features have different numerical scales, so I'm not sure why normalizing would have no effect at all. This is doubly confusing when you consider that using StandardScaler changes the coefficients considerably.
  2. I don't understand why the model using Normalizer causes such radically different coefficient values from the others, especially when the model with LinearRegression(normalize=True) makes no change at all.

如果您要查看每个文档,它们看起来很相似,甚至不相同.

If you were to look at the documentation for each, it appears they're very similar if not identical.

摘录自 sklearn.linear_model.LinearRegression():

标准化:布尔值,可选,默认为False

当fit_intercept设置为False时,将忽略此参数.如果为True,则将在回归之前通过减去均值并除以l2-范数来对回归变量X进行归一化.

This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.

与此同时,sklearn.preprocessing.Normalizer上的文档声明它已规范化默认为l2规范.

Meanwhile, the docs on sklearn.preprocessing.Normalizer states that it normalizes to the l2 norm by default.

我没有看到这两个选项之间的区别,也没有看到为什么一个选项的系数值与另一个选项会有如此根本的差异.

I don't see a difference between what these two options do, and I don't see why one would have such radical differences in coefficient values from the other.

  1. 使用StandardScaler的模型的结果与我一致,但是我不明白为什么使用StandardScaler并设置set_intercept=False的模型的性能如此差.
  1. The results from the model using the StandardScaler are coherent to me, but I don't understand why the model using StandardScaler and setting set_intercept=False performs so poorly.

来自线性回归模块上的文档:

fit_intercept:布尔值,可选,默认为True

是否计算此模型的截距.如果设置为False,则否
截距将用于计算(例如,预计数据已经是
居中).

whether to calculate the intercept for this model. If set to False, no
intercept will be used in calculations (e.g. data is expected to be already
centered).

StandardScaler将您的数据居中,所以我不明白为什么将它与fit_intercept=False一起使用会产生不一致的结果.

The StandardScaler centers your data, so I don't understand why using it with fit_intercept=False produces incoherent results.

推荐答案

  1. 前两个模型之间的系数没有差异的原因是Sklearn在根据归一化的输入数据计算出系数后,对场景背后的系数进行了归一化处理. 参考
  1. The reason for no difference in co-efficients between the first two models is that Sklearn de-normalize the co-efficients behind the scenes after calculating the co-effs from normalized input data. Reference

之所以进行这种非标准化,是因为对于测试数据,我们可以直接应用co-eff.并在不标准化测试数据的情况下获得预测.

This de-normalization has been done because for test data, we can directly apply the co-effs. and get the prediction without normalizing the test data.

因此,设置normalize=True确实会影响系数,但无论如何它们都不会影响最佳拟合线.

Hence, setting normalize=True do have impact on co-efficients but they dont affect the best fit line anyway.

  1. Normalizer对每个样本进行归一化(意味着逐行).您可以在此处 a>.
  1. Normalizer does the normalization with respect to each sample (meaning row-wise). You see the reference code here.

摘自文档:

将样本分别归一化为单位范数.

Normalize samples individually to unit norm.

,而normalize=True对每个列/功能进行归一化. 参考

whereas normalize=True does the normalization with respect to each column/ feature. Reference

示例以了解规范化对数据的不同维度的影响.让我们采用x1& amp;的两个维度. x2和y是目标变量.目标变量值在图中用颜色编码.

Example to understand the impact of normalization at different dimension of the data. Let us take two dimensions x1 & x2 and y be the target variable. Target variable value is color coded in the figure.

import matplotlib.pyplot as plt
from sklearn.preprocessing import Normalizer,StandardScaler
from sklearn.preprocessing.data import normalize

n=50
x1 = np.random.normal(0, 2, size=n)
x2 = np.random.normal(0, 2, size=n)
noise = np.random.normal(0, 1, size=n)
y = 5 + 0.5*x1 + 2.5*x2 + noise

fig,ax=plt.subplots(1,4,figsize=(20,6))

ax[0].scatter(x1,x2,c=y)
ax[0].set_title('raw_data',size=15)

X = np.column_stack((x1,x2))

column_normalized=normalize(X, axis=0)
ax[1].scatter(column_normalized[:,0],column_normalized[:,1],c=y)
ax[1].set_title('column_normalized data',size=15)

row_normalized=Normalizer().fit_transform(X)
ax[2].scatter(row_normalized[:,0],row_normalized[:,1],c=y)
ax[2].set_title('row_normalized data',size=15)

standardized_data=StandardScaler().fit_transform(X)
ax[3].scatter(standardized_data[:,0],standardized_data[:,1],c=y)
ax[3].set_title('standardized data',size=15)

plt.subplots_adjust(left=0.3, bottom=None, right=0.9, top=None, wspace=0.3, hspace=None)
plt.show()

您可能会看到图1,2和4中数据的最佳拟合线是相同的;表示R2_得分不会由于列/功能归一化或标准化数据而改变.就是这样,它最终带来了不同的协同效应.值.

You could see that best fit line for data in fig 1,2 and 4 would be the same; signifies that the R2_-score will not change due to column/feature normalization or standardizing data. Just that, it ends up with different co-effs. values.

注意:fig3的最佳拟合线将有所不同.

Note: best fit line for fig3 would be different.

  1. 设置fit_intercept = False时,将从预测中减去偏差项. 意味着将截距设置为零,否则将是目标变量的均值.
  1. When you set the fit_intercept=False, bias term is subtracted from the prediction. Meaning the intercept is set to zero, which otherwise would have been mean of target variable.

预测对于目标变量未按比例缩放(均值= 0)的问题,将intercept设置为零将导致性能下降.您可以在每行中看到22.532的差异,这表示输出的影响.

The prediction with intercept as zero would be expected to perform bad for problems where target variables are not scaled (mean =0). You can see a difference of 22.532 in every row, which signifies the impact of the output.

这篇关于在线性回归中比较StandardScaler和Normalizer的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆