为什么我不能加载PySpark RandomForestClassifier模型? [英] Why can't I load a PySpark RandomForestClassifier model?

查看:314
本文介绍了为什么我不能加载PySpark RandomForestClassifier模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法加载Spark保存的RandomForestClassificationModel.

I can't load a RandomForestClassificationModel saved by Spark.

环境:Apache Spark 2.0.1,在小型(4台计算机)集群上运行的独立模式.没有HDFS-所有内容均保存到本地磁盘.

Environment: Apache Spark 2.0.1, standalone mode running on a small (4 machine) cluster. No HDFS - everything is saved to local disks.

建立并保存模型:

classifier = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=50)
model = classifier.fit(train)
result = model.transform(test)
model.write().save("/tmp/models/20161030-RF-topics-cats.model")

稍后,在另一个程序中:

Later, in a separate program:

model = RandomForestClassificationModel.load("/tmp/models/20161029-RF-topics-cats.model")

给予:

Py4JJavaError: An error occurred while calling o81.load.
: org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/models/20161029-RF-topics-cats.model/treesMetadata. It must be specified manually;
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:411)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:411)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:410)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:439)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:423)
    at org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:441)
    at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:301

我会注意到,当我使用朴素贝叶斯分类器时,相同的代码也可以工作.

I'd note that the same code works when I use a Naive Bayes classifier.

推荐答案

将模型保存到HDFS,然后再从HDFS中读取模型可能会解决您的问题.

Saving the model to HDFS, and later reading the model from HDFS might solve your problem.

您有4个节点,每个节点都有自己的本地磁盘. 您正在使用model.write().save("/temp/xxx")

You have 4 nodes, each node has its own local-disk. You are using model.write().save("/temp/xxx")

稍后,在另一个程序中: 您正在使用load("/temp/xxx")

Later, in a separate program: You are using load("/temp/xxx")

由于有4个节点,具有4个不同的本地磁盘,因此我不清楚在write.save()操作期间究竟将什么保存(以及保存到哪个本地磁盘),以及到底正在装载什么()以及来自哪个本地磁盘.

Since there are 4 nodes, with 4 different local disks, it isn't clear to me what exactly is being saved (and to which local-disk) during the write.save() operation, and what exactly is being load() and from which local-disk.

这篇关于为什么我不能加载PySpark RandomForestClassifier模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆