SparkR 2.0分类:如何获取性能矩阵? [英] SparkR 2.0 Classification: how to get performance matrices?

查看:128
本文介绍了SparkR 2.0分类:如何获取性能矩阵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何获取SparkR分类中的性能矩阵,例如F1得分,精度,召回率,混淆矩阵

How to get performance matrices in sparkR classification, e.g., F1 score, Precision, Recall, Confusion Matrix

# Load training data
df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm")
training <- df
 testing <- df

# Fit a random forest classification model with spark.randomForest
model <- spark.randomForest(training, label ~ features, "classification", numTrees = 10)

# Model summary
  summary(model)

 # Prediction
  predictions <- predict(model, testing)
  head(predictions)

 # Performance evaluation 

我尝试了caret::confusionMatrix(testing$label,testing$prediction)它显示错误:

   Error in unique.default(x, nmax = nmax) :   unique() applies only to vectors

推荐答案

Caret的confusionMatrix无法正常工作,因为当数据位于Spark数据帧中时,它需要R个数据帧.

Caret's confusionMatrix will not work, since it needs R dataframes while your data are in Spark dataframes.

一种推荐的获取指标的方法是使用

One not recommended way for getting your metrics is to "collect" locally your Spark dataframes to R using as.data.frame, and then use caret etc.; but this means that your data can fit in the main memory of your driver machine, in which case of course you have absolutely no reason to use Spark...

因此,这是一种以iris数据为例的分布式方式(即无需在本地收集数据)获得精度的方法:

So, here is a way to get the accuracy in a distributed manner (i.e. without collecting data locally), using the iris data as an example:

sparkR.version()
# "2.1.1"

df <- as.DataFrame(iris)
model <- spark.randomForest(df, Species ~ ., "classification", numTrees = 10)
predictions <- predict(model, df)
summary(predictions)
# SparkDataFrame[summary:string, Sepal_Length:string, Sepal_Width:string, Petal_Length:string, Petal_Width:string, Species:string, prediction:string]

createOrReplaceTempView(predictions, "predictions")
correct <- sql("SELECT prediction, Species FROM predictions WHERE prediction=Species")
count(correct)
# 149
acc = count(correct)/count(predictions)
acc
# 0.9933333

(关于150个样本中的149个正确的预测,如果您执行showDF(predictions, numRows=150),则确实会看到确实有一个virginica样本被误分类为versicolor).

(Regarding the 149 correct predictions out of 150 samples, if you do a showDF(predictions, numRows=150) you will see indeed that there is a single virginica sample misclassified as versicolor).

这篇关于SparkR 2.0分类:如何获取性能矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆