在Python中使用scikit-learn kmeans对文本文档进行聚类 [英] Clustering text documents using scikit-learn kmeans in Python

查看:763
本文介绍了在Python中使用scikit-learn kmeans对文本文档进行聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要实现 scikit-learn的kMeans 用于将文本文档聚类. 示例代码可以正常使用,但将一些20newsgroups数据作为输入.我想使用相同的代码对文档列表进行聚类,如下所示:

I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a list of documents as shown below:

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

我需要在 kMeans示例中进行哪些更改代码以将此列表用作输入? (仅获取数据集=文档"无效)

What changes do i need to do in kMeans example code to use this list as input? (Simply taking 'dataset = documents' doesn't work)

推荐答案

这是一个更简单的示例:

This is a simpler example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

矢量化文本,即将字符串转换为数字特征

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

集群文档

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

打印每个群集群集的热门术语

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s' % terms[ind],
    print

如果您想更直观地了解其外观,请参见此答案.

If you want to have a more visual idea of how this looks like see this answer.

这篇关于在Python中使用scikit-learn kmeans对文本文档进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆