标记列,在范畴星火 [英] Tagging columns as Categorical in Spark

查看:216
本文介绍了标记列,在范畴星火的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前使用StringIndexer到很多列转换成在RandomForestModel分类独特的整数。我还使用了ML过程中的管道。

有些查询是


  1. 如何在RandomForestModel知道哪些列分类。 StringIndexer转换非 - 数值对数值,但是它添加somesort的一些元数据以表明它是一个类别列?在mllib.tree.RF有参数调用categoricalInfo这表明它们是绝对列。 ml.tree.RF如何知道这是因为这不是present。


  2. 此外,StringIndexer映射类别的基础上OCCURENCES频率整数。现在,当新的数据进来,我该如何确保该数据连接codeD始终与训练数据?我可能坐Ø做,没有StringIndexing整个数据再次,包括新的数据?


我很困惑就如何落实这一点。


解决方案

  

是否有可能Ø做,没有StringIndexing整个数据再次,包括新的数据?


是的,这是可能的。你只需要使用安装在一个训练数据索引器。如果您使用ML的管道将被处理为你只需要使用 StringIndexerModel 直接

 进口org.apache.spark.ml.feature.StringIndexerVAL列车= sc.parallelize(序列((1,A),(2,A),(3,B)))。toDF(×,y)的
VAL测试= sc.parallelize(序列((1,A),(2,B),(3,B)))。toDF(×,y)的VAL索引=新StringIndexer()
  .setInputCol(y)的
  .setOutputCol(y_index)
  .fit(火车)indexer.transform(火车).show// + --- + --- + ------- +
// | X | Y | y_index |
// + --- + --- + ------- +
// | 1 | A | 0.0 |
// | 2 | A | 0.0 |
// | 3 | C | 1.0 |
// + --- + --- + ------- +indexer.transform(测试).show// + --- + --- + ------- +
// | X | Y | y_index |
// + --- + --- + ------- +
// | 1 | A | 0.0 |
// | 2 | C | 1.0 |
// | 3 | C | 1.0 |
// + --- + --- + ------- +

一个可能的条件是,所以你必须改变之前删除这些不正常处理看不见的标签。


  

如何在RandomForestModel知道哪些列分类。


不同ML变压器specialspecial元数据添加到转换后列这表明该列的类型,班数等。

 进口org.apache.spark.ml.attribute._
进口org.apache.spark.ml.feature.VectorAssemblerVAL汇编=新VectorAssembler()
  .setInputCols(阵列(×,y_index))
  .setOutputCol(特征)VAL转化= assembler.transform(indexer.transform(火车))
VAL元= AttributeGroup.fromStructField(transformed.schema(特征))
meta.attributes.get//数组[org.apache.spark.ml.attribute.Attribute] =阵列(
// {类型:数字,IDX:0,名:X},
// {瓦尔斯:[一,B],类型:标称,IDX:1,名:y_index})

  transformed.select($特征)。schema.fields.last.metadata
//ml_attr:{的attrs:{数字:[{IDX:0,名称为:×}],
//标称:[{瓦尔斯:[一,B],IDX:1,名:y_index}]},num_attrs:2}}

I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process.

Some queries are

  1. How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical. How does ml.tree.RF know which are since that is not present.

  2. Also, StringIndexer maps categories to integers based on frequency of occurences. Now, when new data comes in, how do I make sure that this data is encoded consistently with training data? I sit possible o do that without StringIndexing the whole data again including the new data?

I quite confused on how to implement this.

解决方案

Is it possible o do that without StringIndexing the whole data again including the new data?

Yes, it is possible. You just have to use an indexer fitted on a training data. If you use ML pipelines it will be handled for you just use StringIndexerModel directly:

import org.apache.spark.ml.feature.StringIndexer

val train = sc.parallelize(Seq((1, "a"), (2, "a"), (3, "b"))).toDF("x", "y")
val test  = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "b"))).toDF("x", "y")

val indexer = new StringIndexer()
  .setInputCol("y")
  .setOutputCol("y_index")
  .fit(train)

indexer.transform(train).show

// +---+---+-------+
// |  x|  y|y_index|
// +---+---+-------+
// |  1|  a|    0.0|
// |  2|  a|    0.0|
// |  3|  b|    1.0|
// +---+---+-------+

indexer.transform(test).show

// +---+---+-------+
// |  x|  y|y_index|
// +---+---+-------+
// |  1|  a|    0.0|
// |  2|  b|    1.0|
// |  3|  b|    1.0|
// +---+---+-------+

One possible caveat is that it doesn't handle gracefully unseen labels so you have to drop these before transforming.

How does the RandomForestModel know which columns are categorical.

Different ML transformers add specialspecial metadata to the transformed columns which indicate type of the column, number of classes, etc.

import org.apache.spark.ml.attribute._
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("x", "y_index"))
  .setOutputCol("features")

val transformed = assembler.transform(indexer.transform(train))
val meta = AttributeGroup.fromStructField(transformed.schema("features"))
meta.attributes.get

// Array[org.apache.spark.ml.attribute.Attribute] = Array(
//   {"type":"numeric","idx":0,"name":"x"},
//   {"vals":["a","b"],"type":"nominal","idx":1,"name":"y_index"})

or

transformed.select($"features").schema.fields.last.metadata
// "ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"x"}], 
//  "nominal":[{"vals":["a","b"],"idx":1,"name":"y_index"}]},"num_attrs":2}}

这篇关于标记列,在范畴星火的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆