Pyspark-获取使用ParamGridBuilder创建的模型的所有参数 [英] Pyspark - Get all parameters of models created with ParamGridBuilder
问题描述
我正在使用PySpark 2.0进行Kaggle比赛.我想知道模型(RandomForest
)的行为取决于不同的参数. ParamGridBuilder()
允许为单个参数指定不同的值,然后执行(我想)整个参数集的笛卡尔积.假设我的DataFrame
已经定义:
I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model (RandomForest
) depending on different parameters. ParamGridBuilder()
allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame
is already defined:
rdc = RandomForestClassifier()
pipeline = Pipeline(stages=STAGES + [rdc])
paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20])
.addGrid(rdc.minInfoGain, [0.01, 0.001])
.addGrid(rdc.numTrees, [5, 10, 20, 30])
.build()
evaluator = MulticlassClassificationEvaluator()
valid = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
trainRatio=0.50)
model = valid.fit(df)
result = model.bestModel.transform(df)
好的,现在我可以使用手工功能检索简单的信息了:
OK so now I'm able to retrieves simple information with a handmade function:
def evaluate(result):
predictionAndLabels = result.select("prediction", "label")
metrics = ["f1","weightedPrecision","weightedRecall","accuracy"]
for m in metrics:
evaluator = MulticlassClassificationEvaluator(metricName=m)
print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels)))
现在我想要几件事:
- 最佳模型的参数是什么?这篇文章部分回答了这个问题:如何提取PySpark中spark.ml中的模型超参数?
- 所有型号的参数是什么?
- 每个模型的结果(又称为召回率,准确性等)是什么?我只发现
print(model.validationMetrics)
显示(似乎)包含每个模型准确性的列表,但是我不知道要引用哪个模型.
- What are the parameters of the best model? This post partially answers the question: How to extract model hyper-parameters from spark.ml in PySpark?
- What are the parameters of all models?
- What are the results (aka recall, accuracy, etc...) of each model ? I only found
print(model.validationMetrics)
that displays (it seems) a list containing the accuracy of each model, but I can't get to know which model to refers.
如果我可以检索所有这些信息,则我应该能够显示图形,条形图,并可以像处理Panda和sklearn
一样工作.
If I can retrieve all those informations, I should be able to display graphs, bar charts, and work as I do with Panda and sklearn
.
推荐答案
Spark 2.4 +
SPARK-21088 CrossValidator,TrainValidationSplit应该收集所有模型合适时-添加了对收集子模型的支持.
SPARK-21088 CrossValidator, TrainValidationSplit should collect all models when fitting - adds support for collecting submodels.
默认情况下,此行为是禁用的,但可以使用CollectSubModels
Param
(setCollectSubModels
)进行控制.
By default this behavior is disabled, but can be controlled using CollectSubModels
Param
(setCollectSubModels
).
valid = TrainValidationSplit(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
collectSubModels=True)
model = valid.fit(df)
model.subModels
火花< 2.4
长话短说,您根本无法获取所有模型的参数,因为与CrossValidator
类似,TrainValidationSplitModel
保留了只有最好的模型.这些类别是为半自动模型选择而不是探索或实验而设计的.
Long story short you simply cannot get parameters for all models because, similarly to CrossValidator
, TrainValidationSplitModel
retains only the best model. These classes are designed for semi-automated model selection not exploration or experiments.
所有型号的参数是什么?
What are the parameters of all models?
虽然您无法检索实际模型validationMetrics
对应于输入Params
,所以您应该能够简单地同时zip
两者:
While you cannot retrieve actual models validationMetrics
correspond to input Params
so you should be able to simply zip
both:
from typing import Dict, Tuple, List, Any
from pyspark.ml.param import Param
from pyspark.ml.tuning import TrainValidationSplitModel
EvalParam = List[Tuple[float, Dict[Param, Any]]]
def get_metrics_and_params(model: TrainValidationSplitModel) -> EvalParam:
return list(zip(model.validationMetrics, model.getEstimatorParamMaps()))
了解一些指标和参数之间的关系.
to get some about relationship between metrics and parameters.
如果您需要更多信息,则应使用管道Params
.它将保留所有可用于进一步处理的模型:
If you need more information you should use Pipeline Params
. It will preserve all model which can be used for further processing:
models = pipeline.fit(df, params=paramGrid)
它将生成与params
参数相对应的PipelineModels
的列表:
It will generate a list of the PipelineModels
corresponding to the params
argument:
zip(models, params)
这篇关于Pyspark-获取使用ParamGridBuilder创建的模型的所有参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!