RandomForestClassifier被给定的输入Apache中的星火无效标签列错误 [英] RandomForestClassifier was given input with invalid label column error in Apache Spark

查看:699
本文介绍了RandomForestClassifier被给定的输入Apache中的星火无效标签列错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到使用SCALA使用随机森林分类模型5折交叉验证准确度。但我得到以下错误,同时运行:

I am trying to find Accuracy using 5-fold cross validation using Random Forest Classifier Model in SCALA. But i am getting the following error while running:

java.lang.IllegalArgumentException异常:RandomForestClassifier给予无效的标签栏标签输入,没有指定类的数量。见StringIndexer。

java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.

获取上述错误在行---> VAL cvModel = cv.fit(trainingData)

Getting the above error at line---> val cvModel = cv.fit(trainingData)

在code这是我用于数据的交叉验证组采用随机森林如下:

The code which i used for cross validation of data set using random forest is as follows:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

val data = sc.textFile("exprogram/dataset.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(41).toDouble, 
Vectors.dense(parts(0).split(',').map(_.toDouble)))
}


val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)

val trainingData = training.toDF()

val testData = test.toDF()

val nFolds: Int = 5
val NumTrees: Int = 5

val rf = new     
RandomForestClassifier()
      .setLabelCol("label")
      .setFeaturesCol("features")
      .setNumTrees(NumTrees)

val pipeline = new Pipeline()
      .setStages(Array(rf)) 

val paramGrid = new ParamGridBuilder()
          .build()

val evaluator = new  MulticlassClassificationEvaluator()
    .setLabelCol("label")
    .setPredictionCol("prediction")
    .setMetricName("precision") 

val cv = new CrossValidator()
   .setEstimator(pipeline)
   .setEvaluator(evaluator) 
   .setEstimatorParamMaps(paramGrid)
   .setNumFolds(nFolds)

val cvModel = cv.fit(trainingData)

val results = cvModel.transform(testData)
.select("label","prediction").collect

val numCorrectPredictions = results.map(row => 
if (row.getDouble(0) == row.getDouble(1)) 1 else 0).foldLeft(0)(_ + _)
val accuracy = 1.0D * numCorrectPredictions / results.size

println("Test set accuracy: %.3f".format(accuracy))

任何一个可以请解释什么是在上面的code中的错误。

Can any one please explain what is the mistake in the above code.

推荐答案

RandomForestClassifier ,同许多其他ML算法,需要特定的元数据要在标签栏设置和标签值是从[0,1,2 ...,#classes)psented作为双打重新$ p $积分值。这通常是由上游处理变形金刚 StringIndexer 。既然你转换标签手工元数据字段没有设置,分类不能确认这些要求得到满足。

RandomForestClassifier, same as many other ML algorithms, require specific metadata to be set on the label column and labels values to be integral values from [0, 1, 2 ..., #classes) represented as doubles. Typically this is handled by an upstream Transformers like StringIndexer. Since you convert labels manually metadata fields are not set and classifier cannot confirm that these requirements are satisfied.

val df = Seq(
  (0.0, Vectors.dense(1, 0, 0, 0)),
  (1.0, Vectors.dense(0, 1, 0, 0)),
  (2.0, Vectors.dense(0, 0, 1, 0)),
  (2.0, Vectors.dense(0, 0, 0, 1))
).toDF("label", "features")

val rf = new RandomForestClassifier()
  .setFeaturesCol("features")
  .setNumTrees(5)

rf.setLabelCol("label").fit(df)
// java.lang.IllegalArgumentException: RandomForestClassifier was given input ...

您可以重新连接$用c $ C标签列 StringIndexer

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("label_idx")
  .fit(df)

rf.setLabelCol("label_idx").fit(indexer.transform(df))

集合所需的元数据手动的:

val meta = NominalAttribute
  .defaultAttr
  .withName("label")
  .withValues("0.0", "1.0", "2.0")
  .toMetadata

rf.setLabelCol("label_meta").fit(
  df.withColumn("label_meta", $"label".as("", meta))
)

注意

标签使用 StringIndexer 取决于频率不创造价值

Labels created using StringIndexer depend on the frequency not value:

indexer.labels
// Array[String] = Array(2.0, 0.0, 1.0)

PySpark

在Python中的元数据字段可以直接在架构设置:

In Python metadata fields can be set directly on the schema:

from pyspark.sql.types import StructField, DoubleType

StructField(
    "label", DoubleType(), False,
    {"ml_attr": {
        "name": "label",
        "type": "nominal", 
        "vals": ["0.0", "1.0", "2.0"]
    }}
)

这篇关于RandomForestClassifier被给定的输入Apache中的星火无效标签列错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆