如何绘制 K 均值算法的混淆/相似矩阵 [英] How to plot the confusion/similarity matrix of a K-mean algorithm

查看:58
本文介绍了如何绘制 K 均值算法的混淆/相似矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我应用了 K-mean 算法,使用 scikit learn 对一些文本文档进行分类并显示聚类结果.我想在相似度矩阵中显示我的集群的相似度.我在 scikit learn 库中没有看到任何允许这样做的工具.

# 标题类型:tf-idf 矢量图pca = PCA(n_components=2).fit(headlines)data2D = pca.transform(to_headlines)pl.scatter(data2D[:, 0], data2D[:, 1])km = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=3, random_state=0)km.fit(头条新闻)

有什么方法/库可以让我轻松绘制这个余弦相似度矩阵?

解决方案

如果我没猜错的话,您会希望生成一个类似于

<小时>

接下来,我们可以应用 PCA 和 KMeans.

请注意,我不确定在您的示例中 PCA 的确切含义是什么,因为您实际上并未将 PC 用于 KMeans,另外还不清楚数据集 to_headlines 是什么,您对其进行转换.

在这里,我正在转换输入数据本身,然后使用 PC 进行 KMeans 聚类.我还使用输出来说明 Saikat Kumar Dey 在对您的问题的评论中建议的可视化:散点图,点按聚类标签着色.

# PCApca = PCA(n_components=2).fit(data)data2D = pca.transform(data)# Kmeanskm = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=3, random_state=0)km.fit(data2D)# 表演plt.scatter(data2D[:, 0], data2D[:, 1],c=km.labels_, edgecolor='')plt.xlabel('PC1')plt.ylabel('PC2')plt.show()

<小时>

接下来,我们必须找到我们在开始时生成的真实标签之间的最佳匹配对(这里是mu采样正态分布)和聚类生成的 kmeans 标签.

在这个例子中,我只是简单地匹配它们,使得真阳性预测的数量最大化.请注意,这是一个简单、快速和肮脏的解决方案!

如果您的预测总体上非常好,并且如果每个组在您的数据集中由相似数量的样本表示,它可能会按预期工作 - 否则,它可能会产生不匹配/合并,并且您可能会有点高估因此,您的聚类质量.

欢迎提出更好的解决方案.

# 准备k_labels = km.labels_ # 获取集群标签k_labels_matched = np.empty_like(k_labels)# 对于每个集群标签...对于 np.unique(k_labels) 中的 k:# ...找到并分配最匹配的真实标签match_nums = [np.sum((k_labels==k)*(truth==t)) for t in np.unique(truth)]k_labels_matched[k_labels==k] = np.unique(truth)[np.argmax(match_nums)]

<小时>

既然我们已经匹配了truthspredictions,我们终于可以计算并绘制混淆矩阵.

# 计算混淆矩阵从 sklearn.metrics 导入混淆_矩阵cm =混淆_矩阵(真相,k_labels_matched)# 绘制混淆矩阵plt.imshow(cm,interpolation='none',cmap='Blues')对于 (i, j), z 在 np.ndenumerate(cm) 中:plt.text(j, i, z, ha='center', va='center')plt.xlabel("kmeans 标签")plt.ylabel("真实标签")plt.show()

<小时>

希望这会有所帮助!

I apply a K-mean algorithm to classify some text documents using scikit learn and display the clustering result. I would like to display the similarity of my cluster in a similarity matrix. I didn't see any tool in the scikit learn library that allows to do so.

# headlines type: <class 'numpy.ndarray'> tf-idf vectors
pca = PCA(n_components=2).fit(headlines)
data2D = pca.transform(to_headlines)
pl.scatter(data2D[:, 0], data2D[:, 1])
km = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=3, random_state=0)
km.fit(headlines)

Is there any way/library that will allow me to draw easily this cosine similarity matrix?

解决方案

If I get you right, you'd like to produce a confusion matrix similar to the one shown here. However, this requires a truth and a prediction that can be compared to each other. Assuming that you have some gold standard for the classification of your headlines into k groups (the truth), you could compare this to the KMeans clustering (the prediction).

The only problem with this is that KMeans clustering is agnostic to your truth, meaning the cluster labels that it produces will not be matched to the labels of the gold standard groups. There is, however, a work-around for this, which is to match the kmeans labels to the truth labels based on the best possible match.

Here is an example of how this might work.


First, let's generate some example data - in this case 100 samples with 50 features each, sampled from 4 different (and slightly overlapping) normal distributions. The details are irrelevant; all this is supposed to do is mimic the kind of dataset you might be working with. The truth in this case is the mean of the normal distribution that a sample was generated from.

# User input
n_samples  = 100
n_features =  50

# Prep
truth = np.empty(n_samples)
data  = np.empty((n_samples, n_features))
np.random.seed(42)

# Generate
for i,mu in enumerate(np.random.choice([0,1,2,3], n_samples, replace=True)):
    truth[i]  = mu
    data[i,:] = np.random.normal(loc=mu, scale=1.5, size=n_features)

# Show
plt.imshow(data, interpolation='none')
plt.show()


Next, we can apply the PCA and KMeans.

Note that I am not sure what exactly the point of the PCA is in in your example, since you are not actually using the PCs for your KMeans, plus it is unclear what the dataset to_headlines is, which you transform.

Here, I am transforming the input data itself and then using the PCs for the KMeans clustering. I am also using the output to illustrate the visualization that Saikat Kumar Dey suggested in a comment to your question: a scatter plot with points colored by cluster label.

# PCA
pca = PCA(n_components=2).fit(data)
data2D = pca.transform(data)

# Kmeans
km = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=3, random_state=0)
km.fit(data2D)

# Show
plt.scatter(data2D[:, 0], data2D[:, 1],
            c=km.labels_, edgecolor='')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()


Next, we have to find the best-matching pairs between the truth labels we generated in the beginning (here the mu of the sampled normal distributions) and the kmeans labels generated by the clustering.

In this example, I simply match them such that the number of true-positive predictions is maximized. Note that this is a simplistic, quick-and-dirty solution!

If your predictions are pretty good in general and if each group is represented by a similar number of samples in your dataset, it will probably work as intended - otherwise, it may produce mis-matches/mergers and you may somewhat overestimate the quality of your clustering as a result.

Suggestions for better solutions are welcome.

# Prep
k_labels = km.labels_  # Get cluster labels
k_labels_matched = np.empty_like(k_labels)

# For each cluster label...
for k in np.unique(k_labels):

    # ...find and assign the best-matching truth label
    match_nums = [np.sum((k_labels==k)*(truth==t)) for t in np.unique(truth)]
    k_labels_matched[k_labels==k] = np.unique(truth)[np.argmax(match_nums)]


Now that we have matched truths and predictions, we can finally compute and plot the confusion matrix.

# Compute confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(truth, k_labels_matched)

# Plot confusion matrix
plt.imshow(cm,interpolation='none',cmap='Blues')
for (i, j), z in np.ndenumerate(cm):
    plt.text(j, i, z, ha='center', va='center')
plt.xlabel("kmeans label")
plt.ylabel("truth label")
plt.show()


Hope this helps!

这篇关于如何绘制 K 均值算法的混淆/相似矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆