为什么 StandardScaler 不将元数据附加到输出列? [英] Why does StandardScaler not attach metadata to the output column?

查看:34
本文介绍了为什么 StandardScaler 不将元数据附加到输出列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到 ml StandardScaler 没有将元数据附加到输出列:

I noticed that the ml StandardScaler does not attach metadata to the output column:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._

val df = spark.read.option("header", true)
  .option("inferSchema", true)
  .csv("/path/to/cars.data")

val strId1 = new StringIndexer()
  .setInputCol("v7")
  .setOutputCol("v7_IDX")
val strId2 = new StringIndexer()
  .setInputCol("v8")
  .setOutputCol("v8_IDX")

val assmbleFeatures: VectorAssembler = new VectorAssembler()
  .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX"))
  .setOutputCol("featuresRaw")

val scalerModel = new StandardScaler()
  .setInputCol("featuresRaw")
  .setOutputCol("scaledFeatures")


val plm = new Pipeline()
  .setStages(Array(strId1, strId2, assmbleFeatures, scalerModel))
  .fit(df)

val dft = plm.transform(df)

dft.schema("scaledFeatures").metadata

给出:

res1: org.apache.spark.sql.types.Metadata = {}

这个例子适用于这个数据集(只是适应上面代码中的路径).

This example works on this dataset (just adapt path in code above).

这有什么具体原因吗?将来有没有可能将这个功能添加到 Spark 中?关于不包括复制 StandardScaler 的解决方法的任何建议?

Is there a specific reason for this? Is it likely that this feature will be added to Spark in the future? Any suggestions for a workaround that does not include duplicating the StandardScaler?

推荐答案

虽然丢弃元数据可能不是最幸运的选择,但缩放索引分类特征没有任何意义.StringIndexer 返回的值只是标签.

While discarding metadata is probably not the most fortunate choice, scaling indexed categorical features doesn't make any sense. Values returned by the StringIndexer are just labels.

如果要缩放数值特征,应该是一个单独的阶段:

If you want to scale numerical features, it should be a separate stage:

val numericAssembler: VectorAssembler = new VectorAssembler()
  .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6"))
  .setOutputCol("numericFeatures")

val scaler = new StandardScaler()
  .setInputCol("numericFeatures")
  .setOutputCol("scaledNumericFeatures")

val finalAssembler: VectorAssembler = new VectorAssembler() 
  .setInputCols(Array("scaledNumericFeatures", "v7_IDX"))
  .setOutputCol("features")

new Pipeline()
  .setStages(Array(strId1, strId2, numericAssembler, scaler, finalAssembler))
  .fit(df)

牢记本答案开头提出的问题,您还可以尝试复制元数据:

Keeping in mind concerns raised at the beginning of this answer, you can also try copying the metadata:

val result = plm.transform(df).transform(df => 
  df.withColumn(
   "scaledFeatures", 
   $"scaledFeatures".as(
     "scaledFeatures", 
     df.schema("featuresRaw").metadata)))

esult.schema("scaledFeatures").metadata

{"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"v0"},{"idx":1,"name":"v1"},{"idx":2,"name":"v2"},{"idx":3,"name":"v3"},{"idx":4,"name":"v4"},{"idx":5,"name":"v5"},{"idx":6,"name":"v6"}],"nominal":[{"vals":["ford","chevrolet","plymouth","dodge","amc","toyota","datsun","vw","buick","pontiac","honda","mazda","mercury","oldsmobile","peugeot","fiat","audi","chrysler","volvo","opel","subaru","saab","mercedes","renault","cadillac","bmw","triumph","hi","capri","nissan"],"idx":7,"name":"v7_IDX"}]},"num_attrs":8}}

这篇关于为什么 StandardScaler 不将元数据附加到输出列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆