在 Spark 中将列标记为分类 [英] Tagging columns as Categorical in Spark

查看:26
本文介绍了在 Spark 中将列标记为分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用 StringIndexer 将许多列转换为唯一的整数,以便在 RandomForestModel 中进行分类.我也在为 ML 过程使用管道.

I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process.

有些查询是

  1. RandomForestModel 如何知道哪些列是分类的.StringIndexer 将非数字转换为数字,但它是否添加了某种元数据以表明它是一个分类列?在 mllib.tree.RF 中,有参数调用 categoricalInfo 指示属于分类的列.ml.tree.RF 如何知道哪些是因为它不存在.

  1. How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical. How does ml.tree.RF know which are since that is not present.

此外,StringIndexer 根据出现频率将类别映射到整数.现在,当新数据进来时,我如何确保这些数据与训练数据的编码一致?我可以这样做,而无需再次对整个数据进行 StringIndexing,包括新数据?

Also, StringIndexer maps categories to integers based on frequency of occurences. Now, when new data comes in, how do I make sure that this data is encoded consistently with training data? I sit possible o do that without StringIndexing the whole data again including the new data?

我对如何实现这一点感到非常困惑.

I quite confused on how to implement this.

推荐答案

是否可以在不使用 StringIndexing 再次对整个数据(包括新数据)进行索引的情况下做到这一点?

Is it possible o do that without StringIndexing the whole data again including the new data?

是的,这是可能的.您只需要使用适合训练数据的索引器.如果您使用 ML 管道,它将为您处理,只需直接使用 StringIndexerModel:

Yes, it is possible. You just have to use an indexer fitted on a training data. If you use ML pipelines it will be handled for you just use StringIndexerModel directly:

import org.apache.spark.ml.feature.StringIndexer

val train = sc.parallelize(Seq((1, "a"), (2, "a"), (3, "b"))).toDF("x", "y")
val test  = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "b"))).toDF("x", "y")

val indexer = new StringIndexer()
  .setInputCol("y")
  .setOutputCol("y_index")
  .fit(train)

indexer.transform(train).show

// +---+---+-------+
// |  x|  y|y_index|
// +---+---+-------+
// |  1|  a|    0.0|
// |  2|  a|    0.0|
// |  3|  b|    1.0|
// +---+---+-------+

indexer.transform(test).show

// +---+---+-------+
// |  x|  y|y_index|
// +---+---+-------+
// |  1|  a|    0.0|
// |  2|  b|    1.0|
// |  3|  b|    1.0|
// +---+---+-------+

一个可能的警告是它不能优雅地处理看不见的标签,因此您必须在转换之前删除这些标签.

One possible caveat is that it doesn't handle gracefully unseen labels so you have to drop these before transforming.

RandomForestModel 如何知道哪些列是分类的.

How does the RandomForestModel know which columns are categorical.

不同的 ML 转换器向转换后的列添加特殊的元数据,指示列的类型、类的数量等.

Different ML transformers add specialspecial metadata to the transformed columns which indicate type of the column, number of classes, etc.

import org.apache.spark.ml.attribute._
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("x", "y_index"))
  .setOutputCol("features")

val transformed = assembler.transform(indexer.transform(train))
val meta = AttributeGroup.fromStructField(transformed.schema("features"))
meta.attributes.get

// Array[org.apache.spark.ml.attribute.Attribute] = Array(
//   {"type":"numeric","idx":0,"name":"x"},
//   {"vals":["a","b"],"type":"nominal","idx":1,"name":"y_index"})

transformed.select($"features").schema.fields.last.metadata
// "ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"x"}], 
//  "nominal":[{"vals":["a","b"],"idx":1,"name":"y_index"}]},"num_attrs":2}}

这篇关于在 Spark 中将列标记为分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆