在线性回归中比较StandardScaler和Normalizer的结果 [英] Comparing Results from StandardScaler vs Normalizer in Linear Regression
问题描述
我正在研究一些不同情况下的线性回归示例,将使用Normalizer
和StandardScaler
的结果进行比较,结果令人困惑.
I'm working through some examples of Linear Regression under different scenarios, comparing the results from using Normalizer
and StandardScaler
, and the results are puzzling.
我正在使用波士顿住房数据集,并以此方式进行准备:
I'm using the boston housing dataset, and prepping it this way:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
#load the data
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df['PRICE'] = boston.target
我目前正在尝试从以下情况中得出的结果推论:
I'm currently trying to reason about the results I get from the following scenarios:
- 使用参数
normalize=True
与使用Normalizer
初始化线性回归
- 使用参数
fit_intercept = False
初始化线性回归,并进行标准化和不进行标准化.
- Initializing Linear Regression with the parameter
normalize=True
vs usingNormalizer
- Initializing Linear Regression with the parameter
fit_intercept = False
with and without standardization.
总的来说,我发现结果令人困惑.
Collectively, I find the results confusing.
这是我设置所有内容的方式:
Here's how I'm setting everything up:
# Prep the data
X = df.iloc[:, :-1]
y = df.iloc[:, -1:]
normal_X = Normalizer().fit_transform(X)
scaled_X = StandardScaler().fit_transform(X)
#now prepare some of the models
reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
reg3 = LinearRegression().fit(normal_X, y)
reg4 = LinearRegression().fit(scaled_X, y)
reg5 = LinearRegression(fit_intercept=False).fit(scaled_X, y)
然后,我创建了3个单独的数据帧,以比较每个模型的R_score,系数值和预测.
Then, I created 3 separate dataframes to compare the R_score, coefficient values, and predictions from each model.
要创建数据框以比较每个模型的系数值,我做了以下操作:
To create the dataframe to compare coefficient values from each model, I did the following:
#Create a dataframe of the coefficients
coef = pd.DataFrame({
'coeff': reg1.coef_[0],
'coeff_normalize_true': reg2.coef_[0],
'coeff_normalizer': reg3.coef_[0],
'coeff_scaler': reg4.coef_[0],
'coeff_scaler_no_int': reg5.coef_[0]
})
这是我创建数据框以比较每个模型的R ^ 2值的方法:
Here's how I created the dataframe to compare the R^2 values from each model:
scores = pd.DataFrame({
'score': reg1.score(X, y),
'score_normalize_true': reg2.score(X, y),
'score_normalizer': reg3.score(normal_X, y),
'score_scaler': reg4.score(scaled_X, y),
'score_scaler_no_int': reg5.score(scaled_X, y)
}, index=range(1)
)
最后,这是比较每个预测的数据框:
Lastly, here's the dataframe that compares the predictions from each:
predictions = pd.DataFrame({
'pred': reg1.predict(X).ravel(),
'pred_normalize_true': reg2.predict(X).ravel(),
'pred_normalizer': reg3.predict(normal_X).ravel(),
'pred_scaler': reg4.predict(scaled_X).ravel(),
'pred_scaler_no_int': reg5.predict(scaled_X).ravel()
}, index=range(len(y)))
这是结果数据框:
系数:
得分:
预测:
我有三个我无法调和的问题:
I have three questions that I can't reconcile:
- 为什么前两个模型之间完全没有差异?似乎设置
normalize=False
无效.我可以理解预测值和R ^ 2值相同,但是我的特征具有不同的数值范围,所以我不确定为什么归一化根本没有效果.当您考虑使用StandardScaler
会极大地改变系数时,这会令人困惑. - 我不明白为什么使用
Normalizer
的模型会导致与其他模型如此根本不同的系数值,尤其是当使用LinearRegression(normalize=True)
的模型完全没有变化时.
- Why is there absolutely no difference between the first two models? It appears that setting
normalize=False
does nothing. I can understand having predictions and R^2 values that are the same, but my features have different numerical scales, so I'm not sure why normalizing would have no effect at all. This is doubly confusing when you consider that usingStandardScaler
changes the coefficients considerably. - I don't understand why the model using
Normalizer
causes such radically different coefficient values from the others, especially when the model withLinearRegression(normalize=True)
makes no change at all.
如果您要查看每个文档,它们看起来很相似,甚至不相同.
If you were to look at the documentation for each, it appears they're very similar if not identical.
摘录自 sklearn.linear_model.LinearRegression():
标准化:布尔值,可选,默认为False
当fit_intercept设置为False时,将忽略此参数.如果为True,则将在回归之前通过减去均值并除以l2-范数来对回归变量X进行归一化.
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
与此同时,sklearn.preprocessing.Normalizer
上的文档声明它已规范化默认为l2规范.
Meanwhile, the docs on sklearn.preprocessing.Normalizer
states that it normalizes to the l2 norm by default.
我没有看到这两个选项之间的区别,也没有看到为什么一个选项的系数值与另一个选项会有如此根本的差异.
I don't see a difference between what these two options do, and I don't see why one would have such radical differences in coefficient values from the other.
- 使用
StandardScaler
的模型的结果与我一致,但是我不明白为什么使用StandardScaler
并设置set_intercept=False
的模型的性能如此差.
- The results from the model using the
StandardScaler
are coherent to me, but I don't understand why the model usingStandardScaler
and settingset_intercept=False
performs so poorly.
来自线性回归模块上的文档:
fit_intercept:布尔值,可选,默认为True
是否计算此模型的截距.如果设置为False,则否
截距将用于计算(例如,预计数据已经是
居中).
whether to calculate the intercept for this model. If set to False, no
intercept will be used in calculations (e.g. data is expected to be already
centered).
StandardScaler
将您的数据居中,所以我不明白为什么将它与fit_intercept=False
一起使用会产生不一致的结果.
The StandardScaler
centers your data, so I don't understand why using it with fit_intercept=False
produces incoherent results.
推荐答案
- 前两个模型之间的系数没有差异的原因是
Sklearn
在根据归一化的输入数据计算出系数后,对场景背后的系数进行了归一化处理. 参考
- The reason for no difference in co-efficients between the first two models is that
Sklearn
de-normalize the co-efficients behind the scenes after calculating the co-effs from normalized input data. Reference
之所以进行这种非标准化,是因为对于测试数据,我们可以直接应用co-eff.并在不标准化测试数据的情况下获得预测.
This de-normalization has been done because for test data, we can directly apply the co-effs. and get the prediction without normalizing the test data.
因此,设置normalize=True
确实会影响系数,但无论如何它们都不会影响最佳拟合线.
Hence, setting normalize=True
do have impact on co-efficients but they dont affect the best fit line anyway.
-
Normalizer
对每个样本进行归一化(意味着逐行).您可以在此处 a>.
Normalizer
does the normalization with respect to each sample (meaning row-wise). You see the reference code here.
将样本分别归一化为单位范数.
Normalize samples individually to unit norm.
,而normalize=True
对每个列/功能进行归一化. 参考
whereas normalize=True
does the normalization with respect to each column/ feature. Reference
示例以了解规范化对数据的不同维度的影响.让我们采用x1& amp;的两个维度. x2和y是目标变量.目标变量值在图中用颜色编码.
Example to understand the impact of normalization at different dimension of the data. Let us take two dimensions x1 & x2 and y be the target variable. Target variable value is color coded in the figure.
import matplotlib.pyplot as plt
from sklearn.preprocessing import Normalizer,StandardScaler
from sklearn.preprocessing.data import normalize
n=50
x1 = np.random.normal(0, 2, size=n)
x2 = np.random.normal(0, 2, size=n)
noise = np.random.normal(0, 1, size=n)
y = 5 + 0.5*x1 + 2.5*x2 + noise
fig,ax=plt.subplots(1,4,figsize=(20,6))
ax[0].scatter(x1,x2,c=y)
ax[0].set_title('raw_data',size=15)
X = np.column_stack((x1,x2))
column_normalized=normalize(X, axis=0)
ax[1].scatter(column_normalized[:,0],column_normalized[:,1],c=y)
ax[1].set_title('column_normalized data',size=15)
row_normalized=Normalizer().fit_transform(X)
ax[2].scatter(row_normalized[:,0],row_normalized[:,1],c=y)
ax[2].set_title('row_normalized data',size=15)
standardized_data=StandardScaler().fit_transform(X)
ax[3].scatter(standardized_data[:,0],standardized_data[:,1],c=y)
ax[3].set_title('standardized data',size=15)
plt.subplots_adjust(left=0.3, bottom=None, right=0.9, top=None, wspace=0.3, hspace=None)
plt.show()
您可能会看到图1,2和4中数据的最佳拟合线是相同的;表示R2_得分不会由于列/功能归一化或标准化数据而改变.就是这样,它最终带来了不同的协同效应.值.
You could see that best fit line for data in fig 1,2 and 4 would be the same; signifies that the R2_-score will not change due to column/feature normalization or standardizing data. Just that, it ends up with different co-effs. values.
注意:fig3
的最佳拟合线将有所不同.
Note: best fit line for fig3
would be different.
- 设置fit_intercept = False时,将从预测中减去偏差项. 意味着将截距设置为零,否则将是目标变量的均值.
- When you set the fit_intercept=False, bias term is subtracted from the prediction. Meaning the intercept is set to zero, which otherwise would have been mean of target variable.
预测对于目标变量未按比例缩放(均值= 0)的问题,将intercept设置为零将导致性能下降.您可以在每行中看到22.532的差异,这表示输出的影响.
The prediction with intercept as zero would be expected to perform bad for problems where target variables are not scaled (mean =0). You can see a difference of 22.532 in every row, which signifies the impact of the output.
这篇关于在线性回归中比较StandardScaler和Normalizer的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!