无法在简单示例上从 spark ML 运行 RandomForestClassifier [英] Cannot run RandomForestClassifier from spark ML on a simple example

查看:32
本文介绍了无法在简单示例上从 spark ML 运行 RandomForestClassifier的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从 spark.ml 包(版本 1.5.2)运行实验性的 RandomForestClassifier.我使用的数据集来自 中的 LogisticRegression 示例Spark 机器学习指南.

I have tried to run the experimental RandomForestClassifier from the spark.ml package (version 1.5.2). The dataset I used is from the LogisticRegression example in the Spark ML guide.

代码如下:

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.Row

// Prepare training data from a list of (label, features) tuples.
val training = sqlContext.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

val rf = new RandomForestClassifier()

val model = rf.fit(training)

这是错误,我得到:

java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.
    at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:87)
    at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:42)
    at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:48)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:61)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:63)
    at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:65)
    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
    at $iwC$$iwC$$iwC.<init>(<console>:69)
    at $iwC$$iwC.<init>(<console>:71)
    at $iwC.<init>(<console>:73)
    at <init>(<console>:75)
    at .<init>(<console>:79)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
    at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
    at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
    at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
    at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
    at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
    at org.apache.spark.repl.Main$.main(Main.scala:31)
    at org.apache.spark.repl.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

当函数尝试计算列 "label" 中的类数时,就会出现问题.

The problem appears when the function tries to compute the number of classes in the column "label".

正如您在 RandomForestClassifier的源代码,该函数调用带有参数"label"DataFrame.schema函数.这个调用没问题,并返回一个 org.apache.spark.sql.types.StructField 对象.然后,调用函数 org.apache.spark.ml.util.MetadataUtils.getNumClasses.由于它没有返回预期的输出,因此在第 87 行引发异常.

As you can see at line 84 in the source code of RandomForestClassifier, the function calls the DataFrame.schema function with parameter "label". This call is OK and returns a org.apache.spark.sql.types.StructField object. Then, the function org.apache.spark.ml.util.MetadataUtils.getNumClasses is called. As it does not return the expected output, an exception is raised at line 87.

快速浏览一下 getNumClasses 源代码,我认为错误是由于 colmun "label" 中的数据既不是BinaryAttribute 都不是 NominalAttribute.但是,我不知道如何解决这个问题.

After a quick glance at getNumClasses source code, I suppose that the error is due to the fact that the data in colmun "label" are neither BinaryAttribute neither NominalAttribute. However, I do not know how to fix this problem.

我的问题:

我该如何解决这个问题?

How can I fix this problem?

非常感谢您阅读我的问题并提供帮助!

Thanks a lot for reading my question and for your help!

推荐答案

让我们先修复导入以消除歧义

Let's first fix the import to remove ambiguity

import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.feature.{StringIndexer, VectorIndexer}
import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.linalg.Vectors

我将使用您使用的相同数据:

I'll use the same data you used :

val training = sqlContext.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

然后创建流水线阶段:

val stages = new scala.collection.mutable.ArrayBuffer[PipelineStage]()

  1. 对于分类,重新索引类:

val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(training)

  1. 使用 VectorIndexer 识别分类特征

val featuresIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(10).fit(training)
stages += featuresIndexer

val tmp = featuresIndexer.transform(labelIndexer.transform(training))

  1. 学习随机森林

val rf = new RandomForestClassifier().setFeaturesCol(featuresIndexer.getOutputCol).setLabelCol(labelIndexer.getOutputCol)

stages += rf
val pipeline = new Pipeline().setStages(stages.toArray)

// Fit the Pipeline
val pipelineModel = pipeline.fit(tmp)

val results = pipelineModel.transform(training)

results.show

//+-----+--------------+---------------+-------------+-----------+----------+
//|label|      features|indexedFeatures|rawPrediction|probability|prediction|
//+-----+--------------+---------------+-------------+-----------+----------+
//|  1.0| [0.0,1.1,0.1]|  [0.0,1.0,2.0]|   [1.0,19.0]|[0.05,0.95]|       1.0|
//|  0.0|[2.0,1.0,-1.0]|  [1.0,0.0,0.0]|   [17.0,3.0]|[0.85,0.15]|       0.0|
//|  0.0| [2.0,1.3,1.0]|  [1.0,3.0,3.0]|   [14.0,6.0]|  [0.7,0.3]|       0.0|
//|  1.0|[0.0,1.2,-0.5]|  [0.0,2.0,1.0]|   [1.0,19.0]|[0.05,0.95]|       1.0|
//+-----+--------------+---------------+-------------+-----------+----------+

参考资料:关于第 1 步和第 2 步,对于那些想要了解 特征转换器 的更多细节的人,我建议您阅读官方文档 此处.

References: Concerning the step 1. and 2. for those who want more details on Feature transformers, I advice you to read the official documentation here.

这篇关于无法在简单示例上从 spark ML 运行 RandomForestClassifier的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆