Spark中的StandardScaler无法按预期工作 [英] StandardScaler in Spark not working as expected

查看:78
本文介绍了Spark中的StandardScaler无法按预期工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人知道为什么spark将为 StandardScaler 这样做吗?按照 StandardScaler 的定义:

Any idea why spark would be doing this for StandardScaler? As per the definition of StandardScaler:

StandardScaler将一组特征标准化为零均值标准偏差为1.带有Std的标志会将数据缩放为带有withMean的标志时的单位标准偏差(默认为false)在缩放数据之前将其居中.

The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1. The flag withStd will scale the data to unit standard deviation while the flag withMean (false by default) will center the data prior to scaling it.

>>> tmpdf.show(4)
+----+----+----+------------+
|int1|int2|int3|temp_feature|
+----+----+----+------------+
|   1|   2|   3|       [2.0]|
|   7|   8|   9|       [8.0]|
|   4|   5|   6|       [5.0]|
+----+----+----+------------+

>>> sScaler = StandardScaler(withMean=True, withStd=True).setInputCol("temp_feature")
>>> sScaler.fit(tmpdf).transform(tmpdf).show()
+----+----+----+------------+-------------------------------------------+
|int1|int2|int3|temp_feature|StandardScaler_4fe08ca180ab163e4120__output|
+----+----+----+------------+-------------------------------------------+
|   1|   2|   3|       [2.0]|                                     [-1.0]|
|   7|   8|   9|       [8.0]|                                      [1.0]|
|   4|   5|   6|       [5.0]|                                      [0.0]|
+----+----+----+------------+-------------------------------------------+

在麻木的世界

>>> x
array([2., 8., 5.])
>>> (x - x.mean())/x.std()
array([-1.22474487,  1.22474487,  0.        ])

在sklearn世界中

In sklearn world

>>> scaler = StandardScaler(with_mean=True, with_std=True)
>>> data
[[2.0], [8.0], [5.0]]
>>> print(scaler.fit(data).transform(data))
[[-1.22474487]
 [ 1.22474487]
 [ 0.        ]]

推荐答案

您的结果不符合预期的原因是因为

The reason that your results are not as expected is because pyspark.ml.feature.StandardScaler uses the unbiased sample standard deviation instead of the population standard deviation.

从文档中

使用校正后的样本标准偏差来计算单位标准差"计算为无偏样本方差的平方根.

The "unit std" is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.

如果您尝试使用示例标准差尝试使用 numpy 代码,则会看到相同的结果:

If you were to try your numpy code with the sample standard deviation, you'd see the same results:

import numpy as np

x = np.array([2., 8., 5.])
print((x - x.mean())/x.std(ddof=1))
#array([-1.,  1.,  0.])

从建模的角度来看,这几乎肯定不是问题(除非您的数据是全部人口,几乎绝不是这种情况).还请记住,对于大样本量,样本标准偏差接近总体标准偏差.因此,如果您的DataFrame中有很多行,则此处的差异可以忽略不计.

From a modeling perspective, this almost surely isn't a problem (unless your data is the entire population, which is pretty much never the case). Also keep in mind that for large sample sizes, the sample standard deviation approaches the population standard deviation. So if you have many rows in your DataFrame, the difference here will be negligible.

但是,如果您坚持要让缩放器使用总体标准差,则一种怪异"的方法是在DataFrame中添加一行作为列均值.

However, if you insisted on having your scaler use the population standard deviation, one "hacky" way is to add a row to your DataFrame that is the mean of the columns.

回想一下,标准差定义为与均值之差的平方和的平方根.或作为功能:

Recall that the standard deviation is defined as the square root of the sum of the squares of the differences to the mean. Or as a function:

# using the same x as above
def popstd(x): 
    return np.sqrt(sum((xi - x.mean())**2/len(x) for xi in x))

print(popstd(x))
#2.4494897427831779

print(x.std())
#2.4494897427831779

使用无偏标准偏差时的区别仅在于您用 len(x)-1 而不是 len(x)除以.因此,如果添加的样本等于平均值​​,则可以增加分母而不影响整体平均值.

The difference when using the unbiased standard deviation is simply that you divide by len(x)-1 instead of len(x). So if you added a sample that was equal to the mean value, you'd increase the denominator without impacting the overall mean.

假设您具有以下DataFrame:

Suppose you had the following DataFrame:

df = spark.createDataFrame(
    np.array(range(1,10,1)).reshape(3,3).tolist(),
    ["int1", "int2", "int3"]
)
df.show()
#+----+----+----+
#|int1|int2|int3|
#+----+----+----+
#|   1|   2|   3|
#|   4|   5|   6|
#|   7|   8|   9|
#+----+----+----+

将此DataFrame与每一列的平均值结合起来:

Union this DataFrame with the mean value for each column:

import pyspark.sql.functions as f
# This is equivalent to UNION ALL in SQL
df2 = df.union(df.select(*[f.avg(c).alias(c) for c in df.columns]))

现在缩放您的值:

from pyspark.ml.feature import VectorAssembler, StandardScaler
va = VectorAssembler(inputCols=["int2"], outputCol="temp_feature")

tmpdf = va.transform(df2)
sScaler = StandardScaler(
    withMean=True, withStd=True, inputCol="temp_feature", outputCol="scaled"
)
sScaler.fit(tmpdf).transform(tmpdf).show()
#+----+----+----+------------+---------------------+
#|int1|int2|int3|temp_feature|scaled               |
#+----+----+----+------------+---------------------+
#|1.0 |2.0 |3.0 |[2.0]       |[-1.2247448713915892]|
#|4.0 |5.0 |6.0 |[5.0]       |[0.0]                |
#|7.0 |8.0 |9.0 |[8.0]       |[1.2247448713915892] |
#|4.0 |5.0 |6.0 |[5.0]       |[0.0]                |
#+----+----+----+------------+---------------------+

这篇关于Spark中的StandardScaler无法按预期工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆