如何使用 CrossValidator 获得精度/召回率以使用 Spark 训练 NaiveBayes 模型 [英] How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark
问题描述
假设我有一个这样的管道:
Supossed I have a Pipeline like this:
val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features")
val idf = new IDF().setInputCol("features").setOutputCol("idffeatures")
val nb = new org.apache.spark.ml.classification.NaiveBayes()
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb))
val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(paramGrid).setNumFolds(10)
val cvModel = cv.fit(df)
如您所见,我使用 MultiClassClassificationEvaluator 定义了一个 CrossValidator.我见过很多在测试过程中获得精度/召回率等指标的示例,但是当您使用不同的数据集进行测试时,会获得这些指标(例如,请参见 文档).
As you can see I defined a CrossValidator using a MultiClassClassificationEvaluator. I have seen a lot of examples getting metrics like Precision/Recall during testing process but these metris are gotten when you use a different set of data for testing purposes (See for example this documentation).
根据我的理解,CrossValidator 将创建折叠,其中一个折叠将用于测试目的,然后 CrossValidator 将选择最佳模型.我的问题是,是否可以在训练过程中获得 Precision/Recall 指标?
From my understanding, CrossValidator is going to create folds and one fold will be use for testing purposes, then CrossValidator will choose the best model. My question is, is possible to get Precision/Recall metrics during training process?
推荐答案
嗯,实际存储的唯一指标是您在创建 Evaluator
实例时定义的指标.对于 BinaryClassificationEvaluator
,这可以采用以下两个值之一:
Well, the only metric which is actually stored is the one you define when you create an instance of an Evaluator
. For the BinaryClassificationEvaluator
this can take one of the two values:
areaUnderROC
areaUnderPR
前者为默认值,可使用setMetricName
方法设置.
with the former one being default, and can be set using setMetricName
method.
这些值是在训练过程中收集的,可以使用 CrossValidatorModel.avgMetrics
访问.值的顺序对应于 EstimatorParamMaps
(CrossValidatorModel.getEstimatorParamMaps
) 的顺序.
These values are collected during training process and can accessed using CrossValidatorModel.avgMetrics
. Order of values corresponds to the order of EstimatorParamMaps
(CrossValidatorModel.getEstimatorParamMaps
).
这篇关于如何使用 CrossValidator 获得精度/召回率以使用 Spark 训练 NaiveBayes 模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!