如何使用StandardScaler在Spark中标准化ONE列? [英] How to standardize ONE column in Spark using StandardScaler?

查看:383
本文介绍了如何使用StandardScaler在Spark中标准化ONE列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对数据框中的一列(年龄")进行标准化(平均= 0,标准= 1).下面是我在Spark(Python)中的代码:

I am trying to standardize (mean = 0, std = 1) one column ('age') in my data frame. Below is my code in Spark (Python):

from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

# Make my 'age' column an assembler type:
age_assembler = VectorAssembler(inputCols= ['age'], outputCol = "age_feature")

# Create a scaler that takes 'age_feature' as an input column:
scaler = StandardScaler(inputCol="age_feature", outputCol="age_scaled",
                        withStd=True, withMean=True)

# Creating a mini-pipeline for those 2 steps:
age_pipeline = Pipeline(stages=[age_assembler, scaler])
scaled = age_pipeline.fit(sample17)
sample17_scaled = scaled.transform(sample17)
type(sample17_scaled)

似乎运行良好.最后一行产生:"sample17_scaled:pyspark.sql.dataframe.DataFrame"

It seems to run just fine. And the very last line produces: "sample17_scaled:pyspark.sql.dataframe.DataFrame"

但是当我运行下面的行时,它表明新列age_scaled为'vector'类型:|-age_scaled:vector(nullable = true)

But when I run the line below it shows that the new column age_scaled is of type 'vector': |-- age_scaled: vector (nullable = true)

sample17_scaled.printSchema()

如何使用此新列计算任何内容?例如,我无法计算平均值.当我尝试时,它说应该是长"而不是udt.

How can I calcualate anything using this new column? For example, I can't calculate a mean. When I try, it says it should be 'long' and not udt.

非常感谢!

推荐答案

只需使用简单聚合:

from pyspark.sql.functions import stddev, mean, col

sample17 = spark.createDataFrame([(1, ), (2, ), (3, )]).toDF("age")

(sample17
  .select(mean("age").alias("mean_age"), stddev("age").alias("stddev_age"))
  .crossJoin(sample17)
  .withColumn("age_scaled" , (col("age") - col("mean_age")) / col("stddev_age")))

# +--------+----------+---+----------+
# |mean_age|stddev_age|age|age_scaled|
# +--------+----------+---+----------+
# |     2.0|       1.0|  1|      -1.0|
# |     2.0|       1.0|  2|       0.0|
# |     2.0|       1.0|  3|       1.0|
# +--------+----------+---+----------+

mean_age, sttdev_age = sample17.select(mean("age"), stddev("age")).first()
sample17.withColumn("age_scaled", (col("age") - mean_age) / sttdev_age)

# +---+----------+
# |age|age_scaled|
# +---+----------+
# |  1|      -1.0|
# |  2|       0.0|
# |  3|       1.0|
# +---+----------+

如果要Transformer,可以将向量拆分为列.

这篇关于如何使用StandardScaler在Spark中标准化ONE列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆