Pyspark ML - 如何保存管道和 RandomForestClassificationModel [英] Pyspark ML - How to save pipeline and RandomForestClassificationModel

查看:106
本文介绍了Pyspark ML - 如何保存管道和 RandomForestClassificationModel的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法保存使用 python/spark 的 ml 包生成的随机森林模型.

<预><代码>>>>rf = RandomForestClassifier(labelCol="label", featuresCol="features")>>>管道 = 管道(阶段=early_stages + [rf])>>>模型 = pipeline.fit(trainingData)>>>模型.保存(拟合管道")

<块引用>

回溯(最近一次调用最后一次):文件",第 1 行,在AttributeError: 'PipelineModel' 对象没有属性'保存'

<预><代码>>>>rfModel = 模型.stages[8]>>>打印(rf模型)

RandomForestClassificationModel (uid=rfc_46c07f6d7ac8) 有 20 棵树

<代码>>>rfModel.save("rfmodel")

<块引用>

回溯(最近一次调用最后一次):文件",第 1 行,在AttributeError: 'RandomForestClassificationModel' 对象有没有属性保存"**

还尝试通过将 'sc' 作为第一个参数传递给 save 方法.

解决方案

您的代码的主要问题是您使用的是 2.0.0 之前的 Apache Spark 版本.因此,save 尚不可用于 Pipeline API.

这是从官方文档合成的完整示例.让我们先创建我们的管道:

from pyspark.ml import Pipeline从 pyspark.ml.classification 导入 RandomForestClassifier从 pyspark.ml.feature 导入 IndexToString、StringIndexer、VectorIndexer# 加载并解析数据文件,将其转换为 DataFrame.data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")# 索引标签,为标签列添加元数据.# 适合整个数据集以包含索引中的所有标签.label_indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")标签 = label_indexer.fit(data).labels# 自动识别分类特征,并对其进行索引.# 使用 > 设置 maxCategories 功能4 个不同的值被视为连续值.feature_indexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4)early_stages = [label_indexer, feature_indexer]# 将数据拆分为训练集和测试集(保留 30% 用于测试)(train, test) = data.randomSplit([0.7, 0.3])# 训练一个随机森林模型.rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)# 将索引标签转换回原始标签.label_converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labels)# 管道中的链索引器和森林管道 = 管道(stages=early_stages + [rf, label_converter])# 训练模型.这也运行索引器.模型 = pipeline.fit(train)

您现在可以保存您的管道:

>>>模型.save("/tmp/rf")SLF4J:无法加载类org.slf4j.impl.StaticLoggerBinder".SLF4J:默认为无操作 (NOP) 记录器实现SLF4J:有关更多详细信息,请参阅 http://www.slf4j.org/codes.html#StaticLoggerBinder.

您还可以保存射频模型:

>>>rf_model = 模型.stages[2]>>>打印(rf_model)RandomForestClassificationModel (uid=rfc_b368678f4122) 有 10 棵树>>>rf_model.save("/tmp/rf_2")

I unable to save random forest model generated using ml package of python/spark.

>>> rf = RandomForestClassifier(labelCol="label", featuresCol="features")
>>> pipeline = Pipeline(stages=early_stages + [rf])
>>> model = pipeline.fit(trainingData)
>>> model.save("fittedpipeline")

Traceback (most recent call last): File "", line 1, in AttributeError: 'PipelineModel' object has no attribute 'save'

>>> rfModel = model.stages[8]
>>> print(rfModel)

RandomForestClassificationModel (uid=rfc_46c07f6d7ac8) with 20 trees

>> rfModel.save("rfmodel")

Traceback (most recent call last): File "", line 1, in AttributeError: 'RandomForestClassificationModel' object has no attribute 'save'**

Also tried by pass 'sc' as first parameter to save method.

解决方案

The main issue with your code is that you are using a version of Apache Spark prior to 2.0.0. Thus, save isn't available yet for the Pipeline API.

Here is a full example compounded from the official documentation. Let's create our pipeline first:

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
label_indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
labels = label_indexer.fit(data).labels

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
feature_indexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4)

early_stages = [label_indexer, feature_indexer]

# Split the data into training and test sets (30% held out for testing)
(train, test) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# Convert indexed labels back to original labels.
label_converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=early_stages + [rf, label_converter])

# Train model. This also runs the indexers.
model = pipeline.fit(train)

You can now save your pipeline:

>>> model.save("/tmp/rf")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

You can also save the RF model :

>>> rf_model = model.stages[2]
>>> print(rf_model)
RandomForestClassificationModel (uid=rfc_b368678f4122) with 10 trees
>>> rf_model.save("/tmp/rf_2")

这篇关于Pyspark ML - 如何保存管道和 RandomForestClassificationModel的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆