sklearn standardscaler结果与手动结果不同 [英] sklearn standardscaler result different to manual result

查看:180
本文介绍了sklearn standardscaler结果与手动结果不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用sklearn standardscaler(均值去除和方差缩放)缩放数据框,并将其与数据框进行比较,在这里我手动"减去了平均值并除以标准差.比较显示出一致的微小差异.有人可以解释为什么吗? (我使用的数据集是这样的: http://archive.ics.uci.edu/ml /datasets/Wine

I used the sklearn standardscaler (mean removal and variance scaling) to scale a dataframe and compared it to a dataframe where I "manually" subtracted the mean and divided by the standard deviation. The comparison shows consistent small differences. Can anybody explain why? (The dataset I used is this: http://archive.ics.uci.edu/ml/datasets/Wine

import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("~/DataSets/WineDataSetItaly/wine.data.txt", names=["Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"])

cols = list(df.columns)[1:]    # I didn't want to scale the "Class" column
std_scal = StandardScaler()
standardized = std_scal.fit_transform(df[cols])
df_standardized_fit = pd.DataFrame(standardized, index=df.index, columns=df.columns[1:])

df_standardized_manual = (df - df.mean()) / df.std()
df_standardized_manual.drop("Class", axis=1, inplace=True)

df_differences = df_standardized_fit - df_standardized_manual
df_differences.iloc[:,:5]


    Alcohol    Malic acid   Ash         Alcalinity  Magnesium
0   0.004272    -0.001582   0.000653    -0.003290   0.005384
1   0.000693    -0.001405   -0.002329   -0.007007   0.000051
2   0.000554    0.000060    0.003120    -0.000756   0.000249
3   0.004758    -0.000976   0.001373    -0.002276   0.002619
4   0.000832    0.000640    0.005177    0.001271    0.003606
5   0.004168    -0.001455   0.000858    -0.003628   0.002421

推荐答案

scikit-learn使用 np.std ,默认情况下为总体标准偏差(偏差的平方和除以观察数),而熊猫使用样本标准偏差(分母为观察结果-1)(请参阅维基百科的标准差文章).这是对总体标准偏差进行无偏估计并由自由度(ddof)确定的校正因子.因此,默认情况下,numpy和scikit-learn的计算使用ddof=0,而pandas使用ddof=1( html"rel =" noreferrer>文档).

scikit-learn uses np.std which by default is the population standard deviation (where the sum of squared deviations are divided by the number of observations) and pandas uses the sample standard deviations (where the denominator is number of observations - 1) (see Wikipedia's standard deviation article). That's a correction factor to have an unbiased estimate of the population standard deviation and determined by the degrees of freedom (ddof). So by default, numpy's and scikit-learn's calculations use ddof=0 whereas pandas uses ddof=1 (docs).

DataFrame.std(axis = None,skipna = None,level = None,ddof = 1,numeric_only = None,** kwargs)

DataFrame.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

返回请求轴上的样品标准偏差.

Return sample standard deviation over requested axis.

默认情况下被N-1标准化.可以使用ddof进行更改 论点

Normalized by N-1 by default. This can be changed using the ddof argument

如果您将熊猫版本更改为:

If you change your pandas version to:

df_standardized_manual = (df - df.mean()) / df.std(ddof=0)

差异实际上为零:

        Alcohol    Malic acid           Ash  Alcalinity of ash     Magnesium
0 -8.215650e-15 -5.551115e-16  3.191891e-15       0.000000e+00  2.220446e-16
1 -8.715251e-15 -4.996004e-16  3.441691e-15       0.000000e+00  0.000000e+00
2 -8.715251e-15 -3.955170e-16  2.886580e-15      -5.551115e-17  1.387779e-17
3 -8.437695e-15 -4.440892e-16  3.164136e-15      -1.110223e-16  1.110223e-16
4 -8.659740e-15 -3.330669e-16  2.886580e-15       5.551115e-17  2.220446e-16

这篇关于sklearn standardscaler结果与手动结果不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆