Pyspark-获取使用ParamGridBuilder创建的模型的所有参数 [英] Pyspark - Get all parameters of models created with ParamGridBuilder

查看:1080
本文介绍了Pyspark-获取使用ParamGridBuilder创建的模型的所有参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PySpark 2.0进行Kaggle比赛.我想知道模型(RandomForest)的行为取决于不同的参数. ParamGridBuilder()允许为单个参数指定不同的值,然后执行(我想)整个参数集的笛卡尔积.假设我的DataFrame已经定义:

I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model (RandomForest) depending on different parameters. ParamGridBuilder() allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame is already defined:

rdc = RandomForestClassifier()
pipeline = Pipeline(stages=STAGES + [rdc])
paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20])
                              .addGrid(rdc.minInfoGain, [0.01, 0.001])
                              .addGrid(rdc.numTrees, [5, 10, 20, 30])
                              .build()
evaluator = MulticlassClassificationEvaluator()
valid = TrainValidationSplit(estimator=pipeline,
                             estimatorParamMaps=paramGrid,
                             evaluator=evaluator,
                             trainRatio=0.50)
model = valid.fit(df)
result = model.bestModel.transform(df)

好的,现在我可以使用手工功能检索简单的信息了:

OK so now I'm able to retrieves simple information with a handmade function:

def evaluate(result):
    predictionAndLabels = result.select("prediction", "label")
    metrics = ["f1","weightedPrecision","weightedRecall","accuracy"]
    for m in metrics:
        evaluator = MulticlassClassificationEvaluator(metricName=m)
        print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels)))

现在我想要几件事:

  • 最佳模型的参数是什么?这篇文章部分回答了这个问题:如何提取PySpark中spark.ml中的模型超参数?
  • 所有型号的参数是什么?
  • 每个模型的结果(又称为召回率,准确性等)是什么?我只发现print(model.validationMetrics)显示(似乎)包含每个模型准确性的列表,但是我不知道要引用哪个模型.
  • What are the parameters of the best model? This post partially answers the question: How to extract model hyper-parameters from spark.ml in PySpark?
  • What are the parameters of all models?
  • What are the results (aka recall, accuracy, etc...) of each model ? I only found print(model.validationMetrics) that displays (it seems) a list containing the accuracy of each model, but I can't get to know which model to refers.

如果我可以检索所有这些信息,则我应该能够显示图形,条形图,并可以像处理Panda和sklearn一样工作.

If I can retrieve all those informations, I should be able to display graphs, bar charts, and work as I do with Panda and sklearn.

推荐答案

Spark 2.4 +

SPARK-21088 CrossValidator,TrainValidationSplit应该收集所有模型合适时-添加了对收集子模型的支持.

SPARK-21088 CrossValidator, TrainValidationSplit should collect all models when fitting - adds support for collecting submodels.

默认情况下,此行为是禁用的,但可以使用CollectSubModels Param(setCollectSubModels)进行控制.

By default this behavior is disabled, but can be controlled using CollectSubModels Param (setCollectSubModels).

valid = TrainValidationSplit(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,            
    collectSubModels=True)

model = valid.fit(df)

model.subModels

火花< 2.4

长话短说,您根本无法获取所有模型的参数,因为CrossValidator 类似,TrainValidationSplitModel保留了只有最好的模型.这些类别是为半自动模型选择而不是探索或实验而设计的.

Long story short you simply cannot get parameters for all models because, similarly to CrossValidator, TrainValidationSplitModel retains only the best model. These classes are designed for semi-automated model selection not exploration or experiments.

所有型号的参数是什么?

What are the parameters of all models?

虽然您无法检索实际模型validationMetrics对应于输入Params,所以您应该能够简单地同时zip两者:

While you cannot retrieve actual models validationMetrics correspond to input Params so you should be able to simply zip both:

from typing import Dict, Tuple, List, Any
from pyspark.ml.param import Param
from pyspark.ml.tuning import TrainValidationSplitModel

EvalParam = List[Tuple[float, Dict[Param, Any]]]

def get_metrics_and_params(model: TrainValidationSplitModel) -> EvalParam:
    return list(zip(model.validationMetrics, model.getEstimatorParamMaps()))

了解一些指标和参数之间的关系.

to get some about relationship between metrics and parameters.

如果您需要更多信息,则应使用管道Params .它将保留所有可用于进一步处理的模型:

If you need more information you should use Pipeline Params. It will preserve all model which can be used for further processing:

models = pipeline.fit(df, params=paramGrid)

它将生成与params参数相对应的PipelineModels的列表:

It will generate a list of the PipelineModels corresponding to the params argument:

zip(models, params)

这篇关于Pyspark-获取使用ParamGridBuilder创建的模型的所有参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆