sklearn standardscaler结果与手动结果不同 [英] sklearn standardscaler result different to manual result
问题描述
我使用sklearn standardscaler(均值去除和方差缩放)缩放数据框,并将其与数据框进行比较,在这里我手动"减去了平均值并除以标准差.比较显示出一致的微小差异.有人可以解释为什么吗? (我使用的数据集是这样的: http://archive.ics.uci.edu/ml /datasets/Wine
I used the sklearn standardscaler (mean removal and variance scaling) to scale a dataframe and compared it to a dataframe where I "manually" subtracted the mean and divided by the standard deviation. The comparison shows consistent small differences. Can anybody explain why? (The dataset I used is this: http://archive.ics.uci.edu/ml/datasets/Wine
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("~/DataSets/WineDataSetItaly/wine.data.txt", names=["Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"])
cols = list(df.columns)[1:] # I didn't want to scale the "Class" column
std_scal = StandardScaler()
standardized = std_scal.fit_transform(df[cols])
df_standardized_fit = pd.DataFrame(standardized, index=df.index, columns=df.columns[1:])
df_standardized_manual = (df - df.mean()) / df.std()
df_standardized_manual.drop("Class", axis=1, inplace=True)
df_differences = df_standardized_fit - df_standardized_manual
df_differences.iloc[:,:5]
Alcohol Malic acid Ash Alcalinity Magnesium
0 0.004272 -0.001582 0.000653 -0.003290 0.005384
1 0.000693 -0.001405 -0.002329 -0.007007 0.000051
2 0.000554 0.000060 0.003120 -0.000756 0.000249
3 0.004758 -0.000976 0.001373 -0.002276 0.002619
4 0.000832 0.000640 0.005177 0.001271 0.003606
5 0.004168 -0.001455 0.000858 -0.003628 0.002421
推荐答案
scikit-learn使用 np.std ,默认情况下为总体标准偏差(偏差的平方和除以观察数),而熊猫使用样本标准偏差(分母为观察结果-1)(请参阅维基百科的标准差文章).这是对总体标准偏差进行无偏估计并由自由度(ddof
)确定的校正因子.因此,默认情况下,numpy和scikit-learn的计算使用ddof=0
,而pandas使用ddof=1
( html"rel =" noreferrer>文档).
scikit-learn uses np.std which by default is the population standard deviation (where the sum of squared deviations are divided by the number of observations) and pandas uses the sample standard deviations (where the denominator is number of observations - 1) (see Wikipedia's standard deviation article). That's a correction factor to have an unbiased estimate of the population standard deviation and determined by the degrees of freedom (ddof
). So by default, numpy's and scikit-learn's calculations use ddof=0
whereas pandas uses ddof=1
(docs).
DataFrame.std(axis = None,skipna = None,level = None,ddof = 1,numeric_only = None,** kwargs)
DataFrame.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
返回请求轴上的样品标准偏差.
Return sample standard deviation over requested axis.
默认情况下被N-1标准化.可以使用ddof进行更改 论点
Normalized by N-1 by default. This can be changed using the ddof argument
如果您将熊猫版本更改为:
If you change your pandas version to:
df_standardized_manual = (df - df.mean()) / df.std(ddof=0)
差异实际上为零:
Alcohol Malic acid Ash Alcalinity of ash Magnesium
0 -8.215650e-15 -5.551115e-16 3.191891e-15 0.000000e+00 2.220446e-16
1 -8.715251e-15 -4.996004e-16 3.441691e-15 0.000000e+00 0.000000e+00
2 -8.715251e-15 -3.955170e-16 2.886580e-15 -5.551115e-17 1.387779e-17
3 -8.437695e-15 -4.440892e-16 3.164136e-15 -1.110223e-16 1.110223e-16
4 -8.659740e-15 -3.330669e-16 2.886580e-15 5.551115e-17 2.220446e-16
这篇关于sklearn standardscaler结果与手动结果不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!