文本聚类的k均值 [英] k-means for text clustering

查看:90
本文介绍了文本聚类的k均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实现用于文本聚类的k-means,特别是英语句子.到目前为止,我对每个文档都有一个频率矩阵术语(句子).我对文本数据中k-means的实际实现有些困惑.这是我对它应该如何工作的猜测.

I'm trying to implement k-means for text clustering, specifically English sentences. So far I'm at the point where I have a term frequency matrix for each document (sentence). I'm a little confused on the actual implementation of k-means on text data. Here's my guess of how it should work.

  1. 计算出所有句子中唯一词的数量(数量很多,称为n).

创建k n维向量(簇),并用一些随机数填充k向量的值(我如何确定这些数字的边界是什么?)

Create k n dimensional vectors (clusters) and fill in the values of the k vectors with some random numbers (how do I decide what the bounds for these numbers are?)

确定从每个q句子到随机k簇,重新定位簇等的欧几里得距离(如果n像英语一样大,则不会计算欧几里得这些向量的距离会非常昂贵吗?)

Determine the Euclidean distance from each of the q sentences to the random k clusters, reposition clusters, etc. (If n is very large like the English language, wouldn't calculating the Euclidean distance for these vectors be very costly?)

感谢您的见解!

推荐答案

发表评论的时间过长.

如果您有文档术语矩阵,请查找(协方差矩阵)的主要成分.确定主成分空间中原始数据的系数.您可以在此空间中进行k均值聚类.

If you have a document term matrix, then find the principal components (of the covariance matrix). Determine the coefficients of the original data in the principal component space. You can do k-means clustering in this space.

对于文本数据,通常需要一堆尺寸-20、50、100甚至更大.另外,我会推荐高斯混合模型/期望最大化聚类而不是k-means,但这是另一回事了.

With text data, you generally need a bunch of dimensions -- 20, 50, 100, or even more. Also, I would recommend Gaussian mixture models/expectation-maximization clustering instead of k-means, but that is another story.

这篇关于文本聚类的k均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆