SparkR 2.0分类:如何获取性能矩阵? [英] SparkR 2.0 Classification: how to get performance matrices?
问题描述
如何获取SparkR分类中的性能矩阵,例如F1得分,精度,召回率,混淆矩阵
How to get performance matrices in sparkR classification, e.g., F1 score, Precision, Recall, Confusion Matrix
# Load training data
df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm")
training <- df
testing <- df
# Fit a random forest classification model with spark.randomForest
model <- spark.randomForest(training, label ~ features, "classification", numTrees = 10)
# Model summary
summary(model)
# Prediction
predictions <- predict(model, testing)
head(predictions)
# Performance evaluation
我尝试了caret::confusionMatrix(testing$label,testing$prediction)
它显示错误:
Error in unique.default(x, nmax = nmax) : unique() applies only to vectors
推荐答案
Caret的confusionMatrix
无法正常工作,因为当数据位于Spark数据帧中时,它需要R个数据帧.
Caret's confusionMatrix
will not work, since it needs R dataframes while your data are in Spark dataframes.
One not recommended way for getting your metrics is to "collect" locally your Spark dataframes to R using as.data.frame
, and then use caret
etc.; but this means that your data can fit in the main memory of your driver machine, in which case of course you have absolutely no reason to use Spark...
因此,这是一种以iris
数据为例的分布式方式(即无需在本地收集数据)获得精度的方法:
So, here is a way to get the accuracy in a distributed manner (i.e. without collecting data locally), using the iris
data as an example:
sparkR.version()
# "2.1.1"
df <- as.DataFrame(iris)
model <- spark.randomForest(df, Species ~ ., "classification", numTrees = 10)
predictions <- predict(model, df)
summary(predictions)
# SparkDataFrame[summary:string, Sepal_Length:string, Sepal_Width:string, Petal_Length:string, Petal_Width:string, Species:string, prediction:string]
createOrReplaceTempView(predictions, "predictions")
correct <- sql("SELECT prediction, Species FROM predictions WHERE prediction=Species")
count(correct)
# 149
acc = count(correct)/count(predictions)
acc
# 0.9933333
(关于150个样本中的149个正确的预测,如果您执行showDF(predictions, numRows=150)
,则确实会看到确实有一个virginica
样本被误分类为versicolor
).
(Regarding the 149 correct predictions out of 150 samples, if you do a showDF(predictions, numRows=150)
you will see indeed that there is a single virginica
sample misclassified as versicolor
).
这篇关于SparkR 2.0分类:如何获取性能矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!