如何绘制文本簇? [英] How to plot text clusters?

查看:37
本文介绍了如何绘制文本簇?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开始学习使用 Python 和 sklearn 库进行聚类.我写了一个简单的代码来聚类文本数据.我的目标是找到相似句子的组/类.我试图绘制它们,但失败了.

I have started to learn clustering with Python and sklearn library. I have wrote a simple code for clustering text data. My goal is to find groups / clusters of similar sentences. I have tried to plot them but I failed.

问题是文本数据,我总是会收到此错误:

The problem is text data, I always get this error:

ValueError: setting an array element with a sequence.

同样的方法适用于数字数据,但不适用于文本数据.有没有办法绘制相似句子的组/群?此外,是否有办法查看这些组是什么,这些组代表什么,如何识别它们?我打印了 labels = kmeans.predict(x) 但这些只是数字列表,它们代表什么?

The same method works for number data, but does not work for text data. Is there a way to plot groups/clusters of similar sentences? Also, Is there a way to see what are those groups, what does those groups represent, how can I identify them? I printed labels = kmeans.predict(x) but these are just list of numbers, what do they represent?

import pandas as pd
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt


x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
    'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
     'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
     'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
x = cv.fit_transform(x)
#x_test = cv.transform(x_test)


my_list = []

for i in range(1,11):

    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
    kmeans.fit(x)
    my_list.append(kmeans.inertia_)
    labels = kmeans.predict(x) #this prints the array of numbers
    print(labels)

plt.plot(range(1,11),my_list)
plt.show()



kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(x)

plt.scatter(x[y_kmeans == 0,0], x[y_kmeans==0,1], s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0], x[y_kmeans==1,1], s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0], x[y_kmeans==2,1], s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0], x[y_kmeans==3,1], s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0], x[y_kmeans==4,1], s = 15, c= 'magenta', label = 'Cluster_5')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 100, c = 'black', label = 'Centroids')
plt.show()

推荐答案

这个问题有几个动人之处:

There are several moving pieces to this question:

  1. 如何将文本向量化为 kmeans 聚类可以理解的数据
  2. 如何在二维空间中绘制聚类
  3. 如何通过源句子标记剧情

我的解决方案遵循一种非常常见的方法,即使用 kmeans 标签作为散点图的颜色.(拟合后的kmeans值分别为0、1、2、3和4,指示每个句子被分配到哪个任意组.输出的顺序与原始样本的顺序相同.)关于如何将这些点分为两个维空间,我使用主成分分析(PCA).请注意,我对完整数据而不是降维输出执行 kmeans 聚类.然后,我使用matplotlib的ax.annotate()用原始句子装饰我的情节.(我还将图表放大,以便在点之间留出空间.)我可以根据要求对此进行进一步评论.

My solution follows a very common approach, which is to use the kmeans labels as colors for the scatter plot. (The kmeans values after fitting are just 0,1,2,3, and 4, indicating which arbitrary group each sentence was assigned to. The output is in the same order as the original samples.) Regarding how to get the points into two dimensional space, I use Principal Component Analysis (PCA). Note that I perform kmeans clustering on the full data, not the dimension-reduced output. I then use matplotlib's ax.annotate() to decorate my plot with the original sentences. (I also make the graph bigger so there's space between the points.) I can comment this further upon request.

import pandas as pd
import re
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
    'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
     'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
     'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
vectors = cv.fit_transform(x)
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
kmean_indices = kmeans.fit_predict(vectors)

pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(vectors.toarray())

colors = ["r", "b", "c", "y", "m" ]

x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))

ax.scatter(x_axis, y_axis, c=[colors[d] for d in kmean_indices])

for i, txt in enumerate(x):
    ax.annotate(txt, (x_axis[i], y_axis[i]))

这篇关于如何绘制文本簇?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆