如何使用matplotlib绘制Kmeans文本聚类结果? [英] How can i plot a Kmeans text clustering result with matplotlib?

查看:354
本文介绍了如何使用matplotlib绘制Kmeans文本聚类结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码将一些示例文本与scikit Learn聚类.

I have the following code to cluster some example text with scikit learn.

train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog", "blue sweater", "red hat", "kitty blue"]

vect = TfidfVectorizer()
X = vect.fit_transform(train)
clf = KMeans(n_clusters=3)
clf.fit(X)
centroids = clf.cluster_centers_

plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=80, linewidths=5)
plt.show()

我不知道的事情是如何绘制聚类结果. X是csr_matrix.我想要的是每个要绘制的结果的(x,y)坐标.

The thing i cant figure out is how i can plot the clustered results. X is a csr_matrix. What i want is (x, y) coord for each result to plot.

Ty

推荐答案

您的tf-idf矩阵最终为3 x 17,因此您需要进行某种投影或降维以获取二维质心.您有几种选择.这是t-SNE的示例:

Your tf-idf matrix ends up being 3 x 17, so you need to do some sort of projection or dimensionality reduction to get centroids in two dimensions. You have several options; here's an example with t-SNE:

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE

train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog",
     "blue sweater", "red hat", "kitty blue"]

vect = TfidfVectorizer()  
X = vect.fit_transform(train)
clf = KMeans(n_clusters=3)
data = clf.fit(X)
centroids = clf.cluster_centers_

tsne_init = 'pca'  # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
random_state = 1
model = TSNE(n_components=2, random_state=random_state, init=tsne_init, perplexity=tsne_perplexity,
         early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)

transformed_centroids = model.fit_transform(centroids)
print transformed_centroids
plt.scatter(transformed_centroids[:, 0], transformed_centroids[:, 1], marker='x')
plt.show()

在您的示例中,如果使用PCA初始化t-SNE,则会得到间隔较宽的质心;如果使用随机初始化,则会得到微小的质心和无趣的图片.

In your example if you use PCA to initialize your t-SNE you get widely spaced centroids; if you use random initialization you'll get tiny centroids and an uninteresting picture.

这篇关于如何使用matplotlib绘制Kmeans文本聚类结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆