scikit-learn中聚类的混淆矩阵 [英] Confusion matrix for Clustering in scikit-learn

查看:148
本文介绍了scikit-learn中聚类的混淆矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组带有已知标签的数据.我想尝试聚类,看看是否可以获得与已知标签相同的聚类.为了测量准确性,我需要获得一个混淆矩阵之类的东西.

我知道对于分类问题的测试集,我可以很容易地得到一个混淆矩阵.我已经尝试过

但是在我的情况下(KMeans聚类),实际值是字符串,估计值是数字(即簇号)

因此,如果我调用 confusion_matrix(y_true,y_pred),则会出现以下错误.

  ValueError:标签输入类型(字符串和数字)的混合 

这是真正的问题.对于分类问题,这是有道理的.但是对于群集问题,不应存在此限制,因为实际标签名称和新群集名称不必相同.

据此,我了解到我正在尝试使用一种用于聚类问题的工具,该工具应用于分类问题.所以,我的问题是,有没有办法为可能的聚类数据获得这样的矩阵.

希望这个问题现在更清楚了.如果不是这样的话,请告诉我.

解决方案

我自己编写了一个代码.

 #计算混淆矩阵def confusion_matrix(act_labels,pred_labels):uniqueLabels = list(set(act_labels))群集=列表(set(pred_labels))cm = [[0对于范围内的我,(len(clusters))]对于范围内的i,我(len(uniqueLabels))]对于我来说,enumerate(uniqueLabels)中的act_label:对于j,枚举中的pred_label(pred_labels):如果act_labels [j] == act_label:cm [i] [pred_label] = cm [i] [pred_label] + 1返回厘米# 例子标签= ['a','b','c','a','b','c','a','b','c','a','b','c']pred = [1,1,2,0,1,2,1,1,1,0,1,2]cnf_matrix = confusion_matrix(labels,pred)print('\ n'.join([''.join(['{:4}'.format(item)for row in item])用于cnf_matrix中的行])) 

修改:(Dayyyuumm)刚发现,我可以使用 Pandas Crosstab轻松完成此操作:-/.

  labels = ['a','b','c','a','b','c','a','b','c','a','b','c']pred = [1,1,2,0,1,2,1,1,1,0,1,2]#创建一个以标签和品种为列的DataFrame:dfdf = pd.DataFrame({'Labels':标签,'Clusters':pred})#创建交叉表:ctct = pd.crosstab(df ['Labels'],df ['Clusters'])#显示ct打印(ct) 

I have a set of data with known labels. I want to try clustering and see if I can get the same clusters given by known labels. To measure the accuracy, I need to get something like a confusion matrix.

I know I can get a confusion matrix easily for a test set of a classification problem. I already tried that like this.

However, it can't be used for clustering as it expected both columns and rows to have the same set of labels, which makes sense for a classification problem. But for a clustering problem what I expect is something like this.

Rows - Actual labels

Columns - New cluster names (i.e. cluster-1, cluster-2 etc.)

Is there a way to do this?

Edit: Here are more details.

In sklearn.metrics.confusion_matrix, it expects y_test and y_pred to have the same values, and labels to be the labels of those values.

That's why it gives a matrix which has the same labels for both rows and columns like this.

But in my case (KMeans Clustering), the real values are Strings and estimated values are numbers (i.e. cluster number)

Therefore, if I call confusion_matrix(y_true, y_pred) it gives below error.

ValueError: Mix of label input types (string and number)

This is the real problem. For a classification problem, this makes sense. But for a clustering problem, this restriction shouldn't be there, because real label names and new cluster names don't need to be the same.

With this, I understand I'm trying to use a tool, which is supposed to be used for classification problems, for a clustering problem. So, my question is, is there a way I can get such a matrix for may clustered data.

Hope the question is now clearer. Please let me know if it isn't.

解决方案

I wrote a code myself.

# Compute confusion matrix
def confusion_matrix(act_labels, pred_labels):
    uniqueLabels = list(set(act_labels))
    clusters = list(set(pred_labels))
    cm = [[0 for i in range(len(clusters))] for i in range(len(uniqueLabels))]
    for i, act_label in enumerate(uniqueLabels):
        for j, pred_label in enumerate(pred_labels):
            if act_labels[j] == act_label:
                cm[i][pred_label] = cm[i][pred_label] + 1
    return cm

# Example
labels=['a','b','c',
        'a','b','c',
        'a','b','c',
        'a','b','c']
pred=[  1,1,2,
        0,1,2,
        1,1,1,
        0,1,2]
cnf_matrix = confusion_matrix(labels, pred)
print('\n'.join([''.join(['{:4}'.format(item) for item in row])
      for row in cnf_matrix]))

Edit: (Dayyyuumm) just found that I could do this easily with Pandas Crosstab :-/.

labels=['a','b','c',
        'a','b','c',
        'a','b','c',
        'a','b','c']
pred=[  1,1,2,
        0,1,2,
        1,1,1,
        0,1,2]   

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'Labels': labels, 'Clusters': pred})

# Create crosstab: ct
ct = pd.crosstab(df['Labels'], df['Clusters'])

# Display ct
print(ct)

这篇关于scikit-learn中聚类的混淆矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆