Apache Spark 中的 RandomForestClassifier 输入带有无效标签列错误 [英] RandomForestClassifier was given input with invalid label column error in Apache Spark

查看:24
本文介绍了Apache Spark 中的 RandomForestClassifier 输入带有无效标签列错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 SCALA 中的随机森林分类器模型使用 5 倍交叉验证来找到准确度.但是我在运行时收到以下错误:

I am trying to find Accuracy using 5-fold cross validation using Random Forest Classifier Model in SCALA. But i am getting the following error while running:

java.lang.IllegalArgumentException:随机森林分类器的输入带有无效的标签列标签,但没有指定类的数量.请参阅 StringIndexer.

java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.

在行出现上述错误---> val cvModel = cv.fit(trainingData)

Getting the above error at line---> val cvModel = cv.fit(trainingData)

我使用随机森林交叉验证数据集的代码如下:

The code which i used for cross validation of data set using random forest is as follows:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

val data = sc.textFile("exprogram/dataset.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(41).toDouble, 
Vectors.dense(parts(0).split(',').map(_.toDouble)))
}


val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)

val trainingData = training.toDF()

val testData = test.toDF()

val nFolds: Int = 5
val NumTrees: Int = 5

val rf = new     
RandomForestClassifier()
      .setLabelCol("label")
      .setFeaturesCol("features")
      .setNumTrees(NumTrees)

val pipeline = new Pipeline()
      .setStages(Array(rf)) 

val paramGrid = new ParamGridBuilder()
          .build()

val evaluator = new  MulticlassClassificationEvaluator()
    .setLabelCol("label")
    .setPredictionCol("prediction")
    .setMetricName("precision") 

val cv = new CrossValidator()
   .setEstimator(pipeline)
   .setEvaluator(evaluator) 
   .setEstimatorParamMaps(paramGrid)
   .setNumFolds(nFolds)

val cvModel = cv.fit(trainingData)

val results = cvModel.transform(testData)
.select("label","prediction").collect

val numCorrectPredictions = results.map(row => 
if (row.getDouble(0) == row.getDouble(1)) 1 else 0).foldLeft(0)(_ + _)
val accuracy = 1.0D * numCorrectPredictions / results.size

println("Test set accuracy: %.3f".format(accuracy))

谁能解释一下上面代码中的错误是什么.

Can any one please explain what is the mistake in the above code.

推荐答案

RandomForestClassifier,与许多其他 ML 算法一样,需要在标签列上设置特定的元数据并且标签值是整数值从 [0, 1, 2 ..., #classes) 表示为双打.通常,这由上游 Transformers 处理,例如 StringIndexer.由于您手动转换标签,因此未设置元数据字段,分类器无法确认满足这些要求.

RandomForestClassifier, same as many other ML algorithms, require specific metadata to be set on the label column and labels values to be integral values from [0, 1, 2 ..., #classes) represented as doubles. Typically this is handled by an upstream Transformers like StringIndexer. Since you convert labels manually metadata fields are not set and classifier cannot confirm that these requirements are satisfied.

val df = Seq(
  (0.0, Vectors.dense(1, 0, 0, 0)),
  (1.0, Vectors.dense(0, 1, 0, 0)),
  (2.0, Vectors.dense(0, 0, 1, 0)),
  (2.0, Vectors.dense(0, 0, 0, 1))
).toDF("label", "features")

val rf = new RandomForestClassifier()
  .setFeaturesCol("features")
  .setNumTrees(5)

rf.setLabelCol("label").fit(df)
// java.lang.IllegalArgumentException: RandomForestClassifier was given input ...

您可以使用 StringIndexer 重新编码标签列:

You can either re-encode label column using StringIndexer:

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("label_idx")
  .fit(df)

rf.setLabelCol("label_idx").fit(indexer.transform(df))

手动设置所需的元数据:

val meta = NominalAttribute
  .defaultAttr
  .withName("label")
  .withValues("0.0", "1.0", "2.0")
  .toMetadata

rf.setLabelCol("label_meta").fit(
  df.withColumn("label_meta", $"label".as("", meta))
)

注意:

使用 StringIndexer 创建的标签取决于频率而不是值:

Labels created using StringIndexer depend on the frequency not value:

indexer.labels
// Array[String] = Array(2.0, 0.0, 1.0)

PySpark:

在 Python 中,元数据字段可以直接在架构上设置:

In Python metadata fields can be set directly on the schema:

from pyspark.sql.types import StructField, DoubleType

StructField(
    "label", DoubleType(), False,
    {"ml_attr": {
        "name": "label",
        "type": "nominal", 
        "vals": ["0.0", "1.0", "2.0"]
    }}
)

这篇关于Apache Spark 中的 RandomForestClassifier 输入带有无效标签列错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆