Spark ML StringIndexer不同标签的培训/测试 [英] Spark ML StringIndexer Different Labels Training/Testing

查看:82
本文介绍了Spark ML StringIndexer不同标签的培训/测试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Scala,并且正在使用StringIndexer为我的训练集中的每个类别分配索引.它根据每个类别的频率分配索引.

I'm using Scala and am using StringIndexer to assign indices to each category in my training set. It assigns indices based on the frequency of each category.

问题在于,在我的测试数据中,类别的频率不同,因此StringIndexer为类别分配了不同的索引,这使我无法正确评估模型(Random Forest).

The problem is that in my testing data, the frequency of the categories are different and so StringIndexer assigns different indices to the categories, which prevents me from evaluating the model (Random Forest) correctly.

我正在以完全相同的方式处理训练/测试数据,并且不保存模型.

I am processing the training/testing data in the exact same way, and don't save the model.

我尝试手动创建标签(通过获取类别的索引),但是出现此错误

I have tried manually creating labels (by getting the index of the category), but get this error

java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.

看来我必须使用StringIndexer,那么如何确保用于测试的未来数据集以与训练集相同的方式索引类别?

It seems that I must use StringIndexer, so how do I ensure that future datasets that I use for testing index the categories the same way as the training set?

编辑添加我尝试的解决方法的代码

这是数据框的外观,将其命名为mydata

This is what the dataframe looks like, call it mydata

+--------+-----+---------+---------+
|category|label|        x|        y|
+--------+-----+---------+---------+
| a|      0.0|  -0.166992|-0.256348|
| b|      1.0|  -0.179199| -0.22998|
| c|      2.0|  -0.172119|-0.105713|
| d|      3.0|  -0.064209| 0.050293|

我使用向量汇编器准备功能

I use vector assembler to prepare features

val assembler = new VectorAssembler().setInputCols(Array("x, y")).setOutputCol("features")

使用上述汇编器转换mydata,该汇编器执行功能列

Transform mydata using above assembler, that does the features column

val predValues = assembler.transform(mydata)

因此,该模型需要2列,特征和标签.因此,我想为此使用自己的标签. 我从predvalues中选择功能

So the model expects 2 columns, features and label. So I want to use my own label for this. I select features from predvalues

 val features = sqlContext.sql("SELECT features from predValues")

然后从我的df中选择标签

And select label from my df

 val labelDF = sqlContext.sql("SELECT label FROM filterFeaturesOnly")

然后将两者结合在一起,这样我就可以将特征和标签传递给模型

And then join the two together so I'll have features and label to pass to model

val featuresAndLabels = features.join(labelDF)

这就是我要传递给模型的内容,并且出现了上面提到的错误.

This is what I am passing to the model, and I get the error mentioned above.

val label = predValues.join(labelDF)

推荐答案

如果要一致地标记事物,则需要保存适合的stringIndexer.

If you want to label things consistently, then you need to save the fitted stringIndexer.

考虑来自文档的此示例代码:

Consider this sample code from the docs:

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")

val indexed = indexer.fit(df).transform(df)

indexer.fit(df)件返回 StringIndexerModel ,然后可以运行transform函数.因此,相反:

The indexer.fit(df) piece returns a StringIndexerModel, which then can run the transform function. So instead:

val indexerModel = indexer.fit(trainDF)
val indexed = indexerModel.transform(trainDF)

稍后将允许您使用indexerModel.transform(testDF)为相同的输入获得相同的标签.

Which will later allow you to use indexerModel.transform(testDF) to get the same labels for the same inputs.

这篇关于Spark ML StringIndexer不同标签的培训/测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆