如何使用 CrossValidator 获得精度/召回率以使用 Spark 训练 NaiveBayes 模型 [英] How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

查看:44
本文介绍了如何使用 CrossValidator 获得精度/召回率以使用 Spark 训练 NaiveBayes 模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个这样的管道:

Supossed I have a Pipeline like this:

val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features")
val idf = new IDF().setInputCol("features").setOutputCol("idffeatures")
val nb = new org.apache.spark.ml.classification.NaiveBayes()
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb))
val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(paramGrid).setNumFolds(10)
val cvModel = cv.fit(df)

如您所见,我使用 MultiClassClassificationEvaluator 定义了一个 CrossValidator.我见过很多在测试过程中获得精度/召回率等指标的示例,但是当您使用不同的数据集进行测试时,会获得这些指标(例如,请参见 文档).

As you can see I defined a CrossValidator using a MultiClassClassificationEvaluator. I have seen a lot of examples getting metrics like Precision/Recall during testing process but these metris are gotten when you use a different set of data for testing purposes (See for example this documentation).

根据我的理解,CrossValidator 将创建折叠,其中一个折叠将用于测试目的,然后 CrossValidator 将选择最佳模型.我的问题是,是否可以在训练过程中获得 Precision/Recall 指标?

From my understanding, CrossValidator is going to create folds and one fold will be use for testing purposes, then CrossValidator will choose the best model. My question is, is possible to get Precision/Recall metrics during training process?

推荐答案

嗯,实际存储的唯一指标是您在创建 Evaluator 实例时定义的指标.对于 BinaryClassificationEvaluator,这可以采用以下两个值之一:

Well, the only metric which is actually stored is the one you define when you create an instance of an Evaluator. For the BinaryClassificationEvaluator this can take one of the two values:

  • areaUnderROC
  • areaUnderPR

前者为默认值,可使用setMetricName方法设置.

with the former one being default, and can be set using setMetricName method.

这些值是在训练过程中收集的,可以使用 CrossValidatorModel.avgMetrics 访问.值的顺序对应于 EstimatorParamMaps (CrossValidatorModel.getEstimatorParamMaps) 的顺序.

These values are collected during training process and can accessed using CrossValidatorModel.avgMetrics. Order of values corresponds to the order of EstimatorParamMaps (CrossValidatorModel.getEstimatorParamMaps).

这篇关于如何使用 CrossValidator 获得精度/召回率以使用 Spark 训练 NaiveBayes 模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆