Spark ML StringIndexer 不同标签训练/测试 [英] Spark ML StringIndexer Different Labels Training/Testing

查看:23
本文介绍了Spark ML StringIndexer 不同标签训练/测试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scala 并使用 StringIndexer 为我的训练集中的每个类别分配索引.它根据每个类别的频率分配索引.

I'm using Scala and am using StringIndexer to assign indices to each category in my training set. It assigns indices based on the frequency of each category.

问题是在我的测试数据中,类别的频率不同,因此 StringIndexer 为类别分配了不同的索引,这使我无法正确评估模型(随机森林).

The problem is that in my testing data, the frequency of the categories are different and so StringIndexer assigns different indices to the categories, which prevents me from evaluating the model (Random Forest) correctly.

我正在以完全相同的方式处理训练/测试数据,并且不保存模型.

I am processing the training/testing data in the exact same way, and don't save the model.

我尝试手动创建标签(通过获取类别的索引),但出现此错误

I have tried manually creating labels (by getting the index of the category), but get this error

java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.

看来我必须使用 StringIndexer,那么我如何确保我用于测试的未来数据集以与训练集相同的方式索引类别?

It seems that I must use StringIndexer, so how do I ensure that future datasets that I use for testing index the categories the same way as the training set?

编辑添加我尝试的解决方法的代码

这是数据框的样子,称之为 mydata

This is what the dataframe looks like, call it mydata

+--------+-----+---------+---------+
|category|label|        x|        y|
+--------+-----+---------+---------+
| a|      0.0|  -0.166992|-0.256348|
| b|      1.0|  -0.179199| -0.22998|
| c|      2.0|  -0.172119|-0.105713|
| d|      3.0|  -0.064209| 0.050293|

我使用矢量汇编器来准备特征

I use vector assembler to prepare features

val assembler = new VectorAssembler().setInputCols(Array("x, y")).setOutputCol("features")

使用上面的汇编程序转换 mydata,即执行 features 列

Transform mydata using above assembler, that does the features column

val predValues = assembler.transform(mydata)

因此模型需要 2 列、特征和标签.所以我想为此使用我自己的标签.我从 predvalues 中选择特征

So the model expects 2 columns, features and label. So I want to use my own label for this. I select features from predvalues

 val features = sqlContext.sql("SELECT features from predValues")

并从我的 df 中选择标签

And select label from my df

 val labelDF = sqlContext.sql("SELECT label FROM filterFeaturesOnly")

然后将两者连接在一起,以便我将特征和标签传递给模型

And then join the two together so I'll have features and label to pass to model

val featuresAndLabels = features.join(labelDF)

这就是我传递给模型的内容,我收到了上面提到的错误.

This is what I am passing to the model, and I get the error mentioned above.

val label = predValues.join(labelDF)

推荐答案

如果你想一致地标注事物,那么你需要保存拟合好的stringIndexer.

If you want to label things consistently, then you need to save the fitted stringIndexer.

考虑文档中的此示例代码:>

Consider this sample code from the docs:

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")

val indexed = indexer.fit(df).transform(df)

indexer.fit(df) 部分返回一个 StringIndexerModel,然后可以运行 transform 函数.所以改为:

The indexer.fit(df) piece returns a StringIndexerModel, which then can run the transform function. So instead:

val indexerModel = indexer.fit(trainDF)
val indexed = indexerModel.transform(trainDF)

稍后将允许您使用 indexerModel.transform(testDF) 为相同的输入获取相同的标签.

Which will later allow you to use indexerModel.transform(testDF) to get the same labels for the same inputs.

这篇关于Spark ML StringIndexer 不同标签训练/测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆