混淆矩阵和列联表之间有什么区别? [英] What is the difference between a Confusion Matrix and Contingency Table?

查看:233
本文介绍了混淆矩阵和列联表之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一段代码来评估我的聚类算法,我发现每种评估方法都需要像A = {aij}这样的m*n矩阵中的基本数据,其中aij是数据点的数量是类ci的成员和类kj的元素.

I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements of cluster kj.

但是在数据挖掘概论中(Pang-Ning Tan等人),似乎有两种这种类型的矩阵,一种是混淆矩阵,另一种是列联表.我不完全了解两者之间的区别.哪个最能描述我要使用的矩阵?

But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which best describes the matrix I want to use?

推荐答案

维基百科的定义:

在人工智能领域,混淆矩阵是 监督学习中通常使用的可视化工具(在 无监督学习,通常称为匹配矩阵).每个 矩阵的一栏代表预测类别中的实例, 而每一行代表一个实际类中的实例.

In the field of artificial intelligence, a confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.

混淆矩阵应该清楚,它基本上告诉了多少实际结果与预测结果相匹配.例如,请参阅此混淆矩阵

Confusion matrix should be clear, it basically tells how many actual results match the predicted results. For example, see this confusion matrix

                 predicted class
                        c1  -  c2
  Actual class   c1     15  -   3
                ___________________
                 c2     0   -   2

它表明:

  1. 第1列第1行意味着分类器已预测15个项目属于类c1,实际上有15个项目属于类c1(这是正确的预测)

  1. Column1, row 1 means that the classifier has predicted 15 items as belonging to class c1, and actually 15 items belong to class c1 (which is a correct prediction)

第二列第1列表明分类器已预测3个项目属于类c2,但它们实际上属于类c1(这是错误的预测)

the second column row 1 tells that the classifier has predicted that 3 items belong to class c2, but they actually belong to class c1 (which is a wrong prediction)

第1列第2行意味着没有任何实际属于类别c2的项目被预测为属于类别c1(这是错误的预测)

Column 1 row 2 means that none of the items that actually belong to class c2 have been predicted to belong to class c1 (which is a wrong prediction)

第2列第2行告诉我们,属于类别c2的2个项目已被预测属于类别c2(这是正确的预测)

Column 2 row 2 tells that 2 items that belong to class c2 have been predicted to belong to class c2 (which is a correct prediction)

现在,请参阅本书中的准确性和错误率公式(第4章,第4.2节),您应该能够清楚地理解什么是混淆矩阵.它用于使用具有已知结果的数据来测试分类器的准确性. K-Fold方法(在书中也提到过)是一种计算分类器准确性的方法之一,在您的书中也曾提到过.

Now see the formula of Accuracy and Error Rate from your book (Chapter 4, 4.2), and you should be able to clearly understand what is a confusion matrix. It is used to test the accuracy of a classifier using data with known results. The K-Fold method (also mentioned in the book) is one of the methods to calculate the accuracy of a classifier that has also been mentioned in your book.

现在,对于应急表: 维基百科的定义:

Now, for Contingency table: Wikipedia's definition:

在统计信息中,列联表(也称为交叉表 制表或交叉制表)是一种矩阵格式的表格, 显示变量的(多变量)频率分布. 它通常用于记录和分析两个或两个之间的关系 更多类别变量.

In statistics, a contingency table (also referred to as cross tabulation or cross tab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. It is often used to record and analyze the relation between two or more categorical variables.

在数据挖掘中,列联表用于显示在阅读中一起出现的项目,例如在交易中或在销售分析的购物车中.例如(这是您提到的书中的示例):

In data mining, contingency tables are used to show what items appeared in a reading together, like in a transaction or in the shopping-cart of a sales analysis. For example (this is the example from the book you have mentioned):

       Coffee  !coffee
tea    150       50      200
!tea   650       150     800
       800       200    1000   

它表明,在1000个响应中(关于它们是否喜欢咖啡和茶或两者或其中之一的响应,是一项调查的结果):

It tells that in 1000 responses (responses about do they like Coffee and tea or both or one of them, results of a survey):

  1. 150个人喜欢茶和咖啡
  2. 50个人喜欢喝茶,但不喜欢咖啡
  3. 650人不喜欢喝茶,但是喜欢咖啡
  4. 150个人既不喜欢茶也不喜欢咖啡

列联表用于查找关联规则的支持度和置信度,基本上用于评估关联规则(请参阅第6章,第6.7.1节).

Contingency tables are used to find the Support and Confidence of association rules, basically to evaluate association rules (read Chapter 6, 6.7.1).

现在的区别是,混淆矩阵用于评估分类器的性能,它告诉分类器在进行分类预测时有多准确,而列联表用于评估关联规则.

Now the difference is that Confusion Matrix is used to evaluate the performance of a classifier, and it tells how accurate a classifier is in making predictions about classification, and contingency table is used to evaluate association rules.

现在,在阅读答案之后,请使用Google(在阅读书时始终使用google),阅读书中的内容,查看一些示例,并且别忘了解决书中给出的一些练习,并且您应该对它们都有一个清晰的概念,以及在特定情况下要使用的内容以及原因.

Now after reading the answer, google a bit (always use google while you are reading your book), read what is in the book, see a few examples, and don't forget to solve a few exercises given in the book, and you should have a clear concept about both of them, and also what to use in a certain situation and why.

希望这会有所帮助.

这篇关于混淆矩阵和列联表之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆