如何使用CrossValidator获得Precision/Recall以使用Spark训练NaiveBayes模型 [英] How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

查看:114
本文介绍了如何使用CrossValidator获得Precision/Recall以使用Spark训练NaiveBayes模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能我有这样的管道:

val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features")
val idf = new IDF().setInputCol("features").setOutputCol("idffeatures")
val nb = new org.apache.spark.ml.classification.NaiveBayes()
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb))
val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(paramGrid).setNumFolds(10)
val cvModel = cv.fit(df)

如您所见,我使用MultiClassClassificationEvaluator定义了一个CrossValidator.我已经看到很多示例在测试过程中获得了诸如Precision/Recall之类的指标,但是当您使用一组不同的数据进行测试时,就会获得这些指标(例如,参见

As you can see I defined a CrossValidator using a MultiClassClassificationEvaluator. I have seen a lot of examples getting metrics like Precision/Recall during testing process but these metris are gotten when you use a different set of data for testing purposes (See for example this documentation).

根据我的理解,CrossValidator将创建折叠,并且将折叠用于测试目的,然后CrossValidator将选择最佳模型.我的问题是,能否在培训过程中获得精确度/召回率"指标?

From my understanding, CrossValidator is going to create folds and one fold will be use for testing purposes, then CrossValidator will choose the best model. My question is, is possible to get Precision/Recall metrics during training process?

推荐答案

好吧,唯一实际存储的度量是创建Evaluator实例时定义的度量.对于BinaryClassificationEvaluator,它可以采用以下两个值之一:

Well, the only metric which is actually stored is the one you define when you create an instance of an Evaluator. For the BinaryClassificationEvaluator this can take one of the two values:

  • areaUnderROC
  • areaUnderPR
  • areaUnderROC
  • areaUnderPR

,前一个是默认设置,可以使用setMetricName方法进行设置.

with the former one being default, and can be set using setMetricName method.

这些值是在培训过程中收集的,可以使用CrossValidatorModel.avgMetrics进行访问.值的顺序与EstimatorParamMaps(CrossValidatorModel.getEstimatorParamMaps)的顺序相对应.

These values are collected during training process and can accessed using CrossValidatorModel.avgMetrics. Order of values corresponds to the order of EstimatorParamMaps (CrossValidatorModel.getEstimatorParamMaps).

这篇关于如何使用CrossValidator获得Precision/Recall以使用Spark训练NaiveBayes模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆