星火,ML,StringIndexer:处理看不见的标签 [英] Spark, ML, StringIndexer: handling unseen labels

查看:663
本文介绍了星火,ML,StringIndexer:处理看不见的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是建立一个multicalss分类。

My goal is to build a multicalss classifier.

我建立用于特征提取一个管道,它包括作为第一工序的StringIndexer变压器每个类名称映射到一个标签,该标签将在分类器训练步骤中使用。

I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step.

管道安装训练集。

该测试集已经被拟合管道以提取相同的特征矢量被处理。

The test set has to be processed by the fitted pipeline in order to extract the same feature vectors.

知道了我的测试设置文件具有训练集的相同的结构。这里可能的情况是在测试将面临一个看不见的类名,在这种情况下StringIndexer将无法找到的标签,以及将引发异常。

Knowing that my test set files have the same structure of the training set. The possible scenario here is to face an unseen class name in the test set, in that case the StringIndexer will fail to find the label, and an exception will be raised.

是否有这种情况的解决方案?或者我们如何才能避免这种情况的发生?

Is there a solution for this case? or how can we avoid that from happening?

推荐答案

没有很好的办法做到这一点,我害怕。无论

No nice way to do it, I'm afraid. Either


  • 申请之前筛选出具有未知标签的测试实例 StringIndexer

  • 或适合 StringIndexer 训练和测试数据框的结合,让你放心,所有的标签都没有

  • 或测试实例情况未知标签转换为已知的标签

  • filter out the test examples with unknown labels before applying StringIndexer
  • or fit StringIndexer to the union of train and test dataframe, so you are assured all labels are there
  • or transform the test example case with unknown label to a known label

下面是一些示例code执行上述操作:

Here is some sample code to perform above operations:

// get training labels from original train dataframe
val trainlabels = traindf.select(colname).distinct.map(_.getString(0)).collect  //Array[String]
// or get labels from a trained StringIndexer model
val trainlabels = simodel.labels 

// define an UDF on your dataframe that will be used for filtering
val filterudf = udf { label:String => trainlabels.contains(label)}

// filter out the bad examples 
val filteredTestdf = testdf.filter( filterudf(testdf(colname)))

// transform unknown value to some value, say "a"
val mapudf = udf { label:String => if (trainlabels.contains(label)) label else "a"}

// add a new column to testdf: 
val transformedTestdf = testdf.withColumn( "newcol", mapudf(testdf(colname)))

这篇关于星火,ML,StringIndexer:处理看不见的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆