在文档上进行Python K-均值聚类 [英] Python K-means clustering on document

查看:129
本文介绍了在文档上进行Python K-均值聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python代码:

subject1=['data mining','web mining','electronic engineering','cloud computing','Smart Biomaterials','Mathematical modeling']
subject2=['Computer Science','Engineering','Biology']

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                               min_df=0.2, stop_words='english',
                               use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(subject1)
print(tfidf_matrix)
km = KMeans(n_clusters=3)
km.fit(tfidf_matrix)
cen = km.cluster_centers_
label = km.labels_

for i in  tfidf_matrix:
print() #should be 'computer science: web mining, data mining, cloud computing'

主题1指特定区域,主题2指一般区域.我尝试通过将K均值应用于三个聚类以与主题2相匹配来对主题1进行聚类.我不知道我想念什么.

subject 1 refer to specific area and subject 2 refer to general area. i try to cluster the subject 1 by applying K-means into three cluster to match with subject 2.i don't know what i miss.

推荐答案

目前尚不清楚您要实现的目标.为了使用k-means算法,您需要明确两个基本问题:

It is not really clear what you want to achieve here. In order to use the k-means algorithm, you need to come clear about two basic questions:

  1. 您的输入数据是什么? k-means算法通常仅适用于一组数据对象,而每个对象可以由多个属性定义.因此,您需要确定是只在subject1上执行聚类还是要整合subject2中的信息,例如通过向subject1中的项目添加属性.
  2. 您的距离量度是什么? k-means的关键部分是找到最接近的质心,这需要对您的数据进行有意义的距离量度.这可以是基于字符的简单距离,也可以是基于数据特征的更特殊的度量.重要的是,您的距离度量代表与相似项目相关的数据方面.
  1. What is your input data? The k-means algorithm usually works on only one set of data objects, while each object can be defined by multiple attributes. So you need to decide, if you want to perform clustering only on subject1 or if you want to integrate information from subject2 e.g. by adding attributes to the items from subject1.
  2. What is your distance measure? The crucial part of k-means is finding nearest centroids, which requires a meaningful distance measure for your data. This might be a simple character-based distance or a more special measure based on your data's features. The important thing is that your distance measure represents the aspects of your data that make to items similar.

如果您想为集群分配某些标签(subject2?),则可以在执行常规k均值算法后完成,例如通过对发现的簇的内省.

If you want to assign certain labels to your clusters (subject2?), this would be done after performing the regular k-means algorithm e.g. by introspection of the found clusters.

这是有关如何应用此算法的非常一般的指导.如果您提供有关您拥有的资产和想要实现的目标的更详细的信息,我们可能会提供更好的帮助.

This is a very general guideline of how to approach the application of this algorithm. If you provide more detailed information on what you have and what you want to achieve, we might be able to give better assistance.

这篇关于在文档上进行Python K-均值聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆