Pyspark ML-如何保存管道和RandomForestClassificationModel [英] Pyspark ML - How to save pipeline and RandomForestClassificationModel
问题描述
我无法保存使用ml/python/spark软件包生成的随机森林模型.
I unable to save random forest model generated using ml package of python/spark.
>>> rf = RandomForestClassifier(labelCol="label", featuresCol="features")
>>> pipeline = Pipeline(stages=early_stages + [rf])
>>> model = pipeline.fit(trainingData)
>>> model.save("fittedpipeline")
回溯(最近一次通话最近):文件",第1行,在AttributeError:"PipelineModel"对象没有属性'保存'
Traceback (most recent call last): File "", line 1, in AttributeError: 'PipelineModel' object has no attribute 'save'
>>> rfModel = model.stages[8]
>>> print(rfModel)
RandomForestClassificationModel(uid = rfc_46c07f6d7ac8)有20棵树
RandomForestClassificationModel (uid=rfc_46c07f6d7ac8) with 20 trees
>> rfModel.save("rfmodel")
回溯(最近一次通话最近):文件",第1行,在AttributeError:"RandomForestClassificationModel"对象具有没有属性保存" **
Traceback (most recent call last): File "", line 1, in AttributeError: 'RandomForestClassificationModel' object has no attribute 'save'**
还尝试通过传递"sc"作为保存方法的第一个参数.
Also tried by pass 'sc' as first parameter to save method.
推荐答案
代码的主要问题是您正在使用2.0.0之前的Apache Spark版本.因此, Pipeline
API尚未提供 save
.
The main issue with your code is that you are using a version of Apache Spark prior to 2.0.0. Thus, save
isn't available yet for the Pipeline
API.
这是来自官方文档的完整示例.首先创建管道:
Here is a full example compounded from the official documentation. Let's create our pipeline first:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
label_indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
labels = label_indexer.fit(data).labels
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
feature_indexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4)
early_stages = [label_indexer, feature_indexer]
# Split the data into training and test sets (30% held out for testing)
(train, test) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
# Convert indexed labels back to original labels.
label_converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=early_stages + [rf, label_converter])
# Train model. This also runs the indexers.
model = pipeline.fit(train)
您现在可以保存管道:
>>> model.save("/tmp/rf")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
您还可以保存RF模型:
You can also save the RF model :
>>> rf_model = model.stages[2]
>>> print(rf_model)
RandomForestClassificationModel (uid=rfc_b368678f4122) with 10 trees
>>> rf_model.save("/tmp/rf_2")
这篇关于Pyspark ML-如何保存管道和RandomForestClassificationModel的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!