混淆矩阵获得精度,召回率,f1score [英] Confusion Matrix to get precsion,recall, f1score

查看:145
本文介绍了混淆矩阵获得精度,召回率,f1score的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框df.我已经在数据帧上执行了DecisionTree分类算法.两列分别是标签和执行算法时的功能.该模型称为 dtc .如何在pyspark中创建混淆矩阵?

I have a dataframe df. I have performed decisionTree classification algorithm on the dataframe. The two columns are label and features when algorithm is performed. The model is called dtc. How can I create a confusion matrix in pyspark?

dtc = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label')
dtcModel = dtc.fit(train)
predictions = dtcModel.transform(test)

来自pyspark.mllib.linalg的

from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.evaluation import MulticlassMetrics

preds = df.select(['label', 'features']) \
                            .df.map(lambda line: (line[1], line[0]))
metrics = MulticlassMetrics(preds)

    # Confusion Matrix
print(metrics.confusionMatrix().toArray())```

推荐答案

在调用 metrics.confusionMatrix().toArray()之前,您需要转换为rdd并映射到元组.

You need to cast to an rdd and map to tuple before calling metrics.confusionMatrix().toArray().

官方文档中,

pyspark.mllib.evaluation.MulticlassMetrics(predictionAndLabels)类[源代码]

class pyspark.mllib.evaluation.MulticlassMetrics(predictionAndLabels)[source]

用于多类分类的评估器.

Evaluator for multiclass classification.

参数:predictionAndLabels –对(预测,标签)对的RDD.

Parameters: predictionAndLabels – an RDD of (prediction, label) pairs.

以下是指导您的示例.

机器学习部分

import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType
#Note the differences between ml and mllib, they are two different libraries.

#create a sample data frame
data = [(1.54,3.45,2.56,0),(9.39,8.31,1.34,0),(1.25,3.31,9.87,1),(9.35,5.67,2.49,2),\
        (1.23,4.67,8.91,1),(3.56,9.08,7.45,2),(6.43,2.23,1.19,1),(7.89,5.32,9.08,2)]

cols = ('a','b','c','d')

df = spark.createDataFrame(data, cols)

assembler = VectorAssembler(inputCols=['a','b','c'], outputCol='features')

df_features = assembler.transform(df)

#df.show()

train_data, test_data = df_features.randomSplit([0.6,0.4])

dtc = DecisionTreeClassifier(featuresCol='features',labelCol='d')

dtcModel = dtc.fit(train_data)

predictions = dtcModel.transform(test_data)

评估部分

#important: need to cast to float type, and order by prediction, else it won't work
preds_and_labels = predictions.select(['predictions','d']).withColumn('label', F.col('d').cast(FloatType())).orderBy('prediction')

#select only prediction and label columns
preds_and_labels = preds_and_labels.select(['prediction','label'])

metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))

print(metrics.confusionMatrix().toArray())

这篇关于混淆矩阵获得精度,召回率,f1score的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆