无法从火花ML上一个简单的例子运行RandomForestClassifier [英] Cannot run RandomForestClassifier from spark ML on a simple example

查看：510 发布时间：2016/5/22 15:56:30 scala apache-spark spark-dataframe apache-spark-ml

本文介绍了无法从火花ML上一个简单的例子运行RandomForestClassifier的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图运行实验 RandomForestClassifier 从 spark.ml 包（1.5.2版）。我使用的数据集是从的逻辑回归例子=nofollow的>星火ML指南。

下面是code：

 进口org.apache.spark.ml.classification.LogisticRegression
进口org.apache.spark.ml.param.ParamMap
进口org.apache.spark.mllib.linalg {向量，向量}
进口org.apache.spark.sql.Row//从（标签功能）元组列表prepare训练数据。
VAL培训= sqlContext.createDataFrame（SEQ（
  （1.0，Vectors.dense（0.0，1.1，0.1）），
  （0.0，Vectors.dense（2.0，1.0，-1.0）），
  （0.0，Vectors.dense（2.0，1.3，1.0）），
  （1.0，Vectors.dense（0.0，1.2，-0.5））
））。toDF（标签，功能）VAL RF =新RandomForestClassifier（）VAL模型= rf.fit（培训）

和以下是错误的，我获得：

  java.lang.IllegalArgumentException异常：RandomForestClassifier给予无效的标签栏标签输入，没有指定类的数量。见StringIndexer。
    在org.apache.spark.ml.classification.RandomForestClassifier.train（RandomForestClassifier.scala：87）
    在org.apache.spark.ml.classification.RandomForestClassifier.train（RandomForestClassifier.scala：42）
    。在org.apache.spark.ml predictor.fit（predictor.scala：90）
    在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ $$ IWC万国表$$ $$ IWC万国表$$ $$ IWC万国表$$ $$ IWC万国表$$ LT＆;＆初始化GT;（小于控制台＆GT;：48）。
    在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ $$ IWC万国表$$ $$ IWC万国表$$ $$ IWC万国表$$ $$ IWC万国表＆LT;＆初始化GT;（小于控制台＆GT;：53）。
    在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ $$ IWC万国表$$ $$ IWC万国表$$ $$ IWC万国表$$ LT＆;＆初始化GT;（小于控制台＆GT;：55）。
    在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ $$ IWC万国表$$ $$ IWC万国表$$ $$ IWC万国表＆LT;＆初始化GT;（小于控制台＆GT;：57）。
    在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ $$ IWC万国表$$ $$ IWC万国表$$ LT＆;＆初始化GT;（小于控制台＆GT;：59）。
    在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ $$ IWC万国表$$ $$ IWC万国表＆LT;＆初始化GT;（小于控制台＆GT;：61）。
    在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ $$ IWC万国表$$ LT＆;＆初始化GT;（小于控制台＆GT;：63）。
    在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC万国表$$ $$ IWC万国表＆LT;＆初始化GT;（小于控制台＆GT;：65）。
    在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ LT＆;＆初始化GT;（小于控制台＆GT;：67）。
    在IWC万国表$ $$ $$ IWC万国表IWC万国表＆LT;＆初始化GT;（小于控制台＆GT;：69）。
    在IWC万国表$ $$ IWC万国表＆LT;＆初始化GT;（小于控制台＆GT;：71）。
    在IWC万国表$＆LT;＆初始化GT;（小于控制台＆GT;：73）。
    在与下;初始化＆GT;（小于控制台＆GT;：75）
    AT＆LT;＆初始化GT;（小于控制台＆GT;：79）。
    在与下; clinit＆GT;（小于控制台＆GT）
    在与下;初始化方式＆gt;（小于控制台＆GT;：7）
    在与下; clinit＆GT;（小于控制台＆GT）
    在$打印（小于控制台＆GT;）
    在sun.reflect.NativeMethodAccessorImpl.invoke0（本机方法）
    在sun.reflect.NativeMethodAccessorImpl.invoke（NativeMethodAccessorImpl.java:62）
    在sun.reflect.DelegatingMethodAccessorImpl.invoke（DelegatingMethodAccessorImpl.java:43）
    在java.lang.reflect.Method.invoke（Method.java:497）
    在org.apache.spark.repl.SparkIMain $ ReadEvalPrint.call（SparkIMain.scala：1065）
    在org.apache.spark.repl.SparkIMain $ Request.loadAndRun（SparkIMain.scala：1340）
    1 org.apache.spark.repl.SparkIMain.loadAndRunReq $（SparkIMain.scala：840）
    在org.apache.spark.repl.SparkIMain.inter preT（SparkIMain.scala：871）
    在org.apache.spark.repl.SparkIMain.inter preT（SparkIMain.scala：819）
    在org.apache.spark.repl.SparkILoop.reallyInter preT $ 1（SparkILoop.scala：857）
    在org.apache.spark.repl.SparkILoop.inter pretStartingWith（SparkILoop.scala：902）
    在org.apache.spark.repl.SparkILoop.command（SparkILoop.scala：814）
    1 org.apache.spark.repl.SparkILoop.processLine $（SparkILoop.scala：657）
    1 org.apache.spark.repl.SparkILoop.innerLoop $（SparkILoop.scala：665）
    在org.apache.spark.repl.SparkILoop.org $阿帕奇$火花$ REPL $ SparkILoop $$环（SparkILoop.scala：670）
    在org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
    在org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    在org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    在scala.tools.nsc.util.ScalaClassLoader $ .savingContextLoader（ScalaClassLoader.scala：135）
    在org.apache.spark.repl.SparkILoop.org $阿帕奇$火花$ REPL $ SparkILoop $$过程（SparkILoop.scala：945）
    在org.apache.spark.repl.SparkILoop.process（SparkILoop.scala：1059）
    在org.apache.spark.repl.Main $。主要（Main.scala：31）
    在org.apache.spark.repl.Main.main（Main.scala）
    在sun.reflect.NativeMethodAccessorImpl.invoke0（本机方法）
    在sun.reflect.NativeMethodAccessorImpl.invoke（NativeMethodAccessorImpl.java:62）
    在sun.reflect.DelegatingMethodAccessorImpl.invoke（DelegatingMethodAccessorImpl.java:43）
    在java.lang.reflect.Method.invoke（Method.java:497）
    在org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    在org.apache.spark.deploy.SparkSubmit $ .doRunMain $ 1（SparkSubmit.scala：180）
    在org.apache.spark.deploy.SparkSubmit $ .submit（SparkSubmit.scala：205）
    在org.apache.spark.deploy.SparkSubmit $。主要（SparkSubmit.scala：120）
    在org.apache.spark.deploy.SparkSubmit.main（SparkSubmit.scala）

在函数试图计算在列类的数量出现问题标签。

正如你可以在<行84看href=\"https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala\"相对=nofollow> RandomForestClassifier 来源$ C $ C，函数调用带参数的 DataFrame.schema 函数标签 。此调用确定并返回 org.apache.spark.sql.types.StructField 对象。
然后，该函数 org.apache.spark.ml.util.MetadataUtils.getNumClasses 被调用。由于它不返回预期的输出，一个例外是在87行上调

快速浏览<经过href=\"https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/MetadataUtils.scala\"相对=nofollow> getNumClasses 源$ C $ C ，我想这错误是由于这样的事实，在colmun 标签既不是 BinaryAttribute 既不 NominalAttribute 。 但是，我不知道如何解决这个问题。

我的问题：

我怎样才能解决这个问题呢？

非常感谢您阅读我的问题，并为您的帮助！

解决方案

让我们先解决进口去除模糊

 进口org.apache.spark.ml.classification.RandomForestClassifier
进口org.apache.spark.ml.feature {StringIndexer，VectorIndexer}
进口org.apache.spark.ml {管道，PipelineStage}
进口org.apache.spark.mllib.linalg.Vectors

我会用你使用相同的数据：

  VAL培训= sqlContext.createDataFrame（SEQ（
  （1.0，Vectors.dense（0.0，1.1，0.1）），
  （0.0，Vectors.dense（2.0，1.0，-1.0）），
  （0.0，Vectors.dense（2.0，1.3，1.0）），
  （1.0，Vectors.dense（0.0，1.2，-0.5））
））。toDF（标签，功能）

，然后创建管道阶段：

  VAL阶段=新scala.collection.mutable.ArrayBuffer [PipelineStage]（）

对于分类，再指数类：

  VAL labelIndexer =新StringIndexer（）。setInputCol（标签）。setOutputCol（indexedLabel）。拟合（培训）

<醇开始=2>

确定使用VectorIndexer类别特征

  VAL featuresIndexer =新VectorIndexer().setInputCol(\"features\").setOutputCol(\"indexedFeatures\").setMaxCategories(10).fit(training)
阶段+ = featuresIndexerVAL TMP = featuresIndexer.transform（labelIndexer.transform（培训））

<醇开始=3>

了解随机森林

  VAL RF =新RandomForestClassifier().setFeaturesCol(featuresIndexer.getOutputCol).setLabelCol(labelIndexer.getOutputCol)阶段+ = RF
VAL管道=新管道（）。setStages（stages.toArray）//装上管道
VAL pipelineModel = pipeline.fit（TMP）VAL结果= pipelineModel.transform（培训）results.show// + ----- + -------------- + --------------- + ---------- --- ----------- + ---------- + +
// |标签|功能| indexedFeatures |生prediction |概率| prediction |
// + ----- + -------------- + --------------- + ---------- --- ----------- + ---------- + +
// | 1.0 | [0.0,1.1,0.1] | [0.0,1.0,2.0] | [1.0,19.0] | [0.05,0.95] | 1.0 |
// | 0.0 | [2.0,1.0，-1.0] | [1.0,0.0,0.0] | [17.0,3.0] | [0.85,0.15] | 0.0 |
// | 0.0 | [2.0,1.3,1.0] | [1.0,3.0,3.0] | [14.0,6.0] | [0.7,0.3] | 0.0 |
// | 1.0 | [0.0,1.2，-0.5] | [0.0,2.0,1.0] | [1.0,19.0] | [0.05,0.95] | 1.0 |
// + ----- + -------------- + --------------- + ---------- --- ----------- + ---------- + +

注意：管道是有点棘手，我怀疑还有的漏洞的在里面，所以我不知道code的优化，但它会做的工作。

参考文献：关于步骤1和2，谁想要的更多细节的功能变压器的，我建议你阅读官方文档的这里。

I have tried to run the experimental RandomForestClassifier from the spark.ml package (version 1.5.2). The dataset I used is from the LogisticRegression example in the Spark ML guide.

Here is the code:

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.Row

// Prepare training data from a list of (label, features) tuples.
val training = sqlContext.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

val rf = new RandomForestClassifier()

val model = rf.fit(training)

And here is the error, I obtain:

java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.
    at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:87)
    at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:42)
    at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:48)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:61)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:63)
    at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:65)
    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
    at $iwC$$iwC$$iwC.<init>(<console>:69)
    at $iwC$$iwC.<init>(<console>:71)
    at $iwC.<init>(<console>:73)
    at <init>(<console>:75)
    at .<init>(<console>:79)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
    at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
    at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
    at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
    at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
    at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
    at org.apache.spark.repl.Main$.main(Main.scala:31)
    at org.apache.spark.repl.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The problem appears when the function tries to compute the number of classes in the column "label".

As you can see at line 84 in the source code of RandomForestClassifier, the function calls the DataFrame.schema function with parameter "label". This call is OK and returns a org.apache.spark.sql.types.StructField object. Then, the function org.apache.spark.ml.util.MetadataUtils.getNumClasses is called. As it does not return the expected output, an exception is raised at line 87.

After a quick glance at getNumClasses source code, I suppose that the error is due to the fact that the data in colmun "label" are neither BinaryAttribute neither NominalAttribute. However, I do not know how to fix this problem.

My question:

How can I fix this problem?

Thanks a lot for reading my question and for your help!

解决方案

Let's first fix the import to remove ambiguity

import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.feature.{StringIndexer, VectorIndexer}
import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.mllib.linalg.Vectors

I'll use the same data you used :

val training = sqlContext.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

and then create Pipeline Stages :

val stages = new scala.collection.mutable.ArrayBuffer[PipelineStage]()

For classification, re-index classes :

val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(training)

Identify categorical features using VectorIndexer

val featuresIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(10).fit(training)
stages += featuresIndexer

val tmp = featuresIndexer.transform(labelIndexer.transform(training))

Learn Random Forest

val rf = new RandomForestClassifier().setFeaturesCol(featuresIndexer.getOutputCol).setLabelCol(labelIndexer.getOutputCol)

stages += rf
val pipeline = new Pipeline().setStages(stages.toArray)

// Fit the Pipeline
val pipelineModel = pipeline.fit(tmp)

val results = pipelineModel.transform(training)

results.show

//+-----+--------------+---------------+-------------+-----------+----------+
//|label|      features|indexedFeatures|rawPrediction|probability|prediction|
//+-----+--------------+---------------+-------------+-----------+----------+
//|  1.0| [0.0,1.1,0.1]|  [0.0,1.0,2.0]|   [1.0,19.0]|[0.05,0.95]|       1.0|
//|  0.0|[2.0,1.0,-1.0]|  [1.0,0.0,0.0]|   [17.0,3.0]|[0.85,0.15]|       0.0|
//|  0.0| [2.0,1.3,1.0]|  [1.0,3.0,3.0]|   [14.0,6.0]|  [0.7,0.3]|       0.0|
//|  1.0|[0.0,1.2,-0.5]|  [0.0,2.0,1.0]|   [1.0,19.0]|[0.05,0.95]|       1.0|
//+-----+--------------+---------------+-------------+-----------+----------+

Note: The Pipeline is a bit tricky and I doubt there are bugs in it, so I'm not sure of the optimization of the code but it will do the work.

References: Concerning the step 1. and 2. for those who want more details on Feature transformers, I advice you to read the official documentation here.

这篇关于无法从火花ML上一个简单的例子运行RandomForestClassifier的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法从火花ML上一个简单的例子运行RandomForestClassifier [英] Cannot run RandomForestClassifier from spark ML on a simple example

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

无法从火花ML上一个简单的例子运行RandomForestClassifier [英] Cannot run RandomForestClassifier from spark ML on a simple example

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭