文本聚类的k均值 [英] k-means for text clustering
问题描述
我正在尝试实现用于文本聚类的k-means,特别是英语句子.到目前为止,我对每个文档都有一个频率矩阵术语(句子).我对文本数据中k-means的实际实现有些困惑.这是我对它应该如何工作的猜测.
I'm trying to implement k-means for text clustering, specifically English sentences. So far I'm at the point where I have a term frequency matrix for each document (sentence). I'm a little confused on the actual implementation of k-means on text data. Here's my guess of how it should work.
-
计算出所有句子中唯一词的数量(数量很多,称为
n
).
创建k
n
维向量(簇),并用一些随机数填充k
向量的值(我如何确定这些数字的边界是什么?)
Create k
n
dimensional vectors (clusters) and fill in the values of the k
vectors with some random numbers (how do I decide what the bounds for these numbers are?)
确定从每个q
句子到随机k
簇,重新定位簇等的欧几里得距离(如果n
像英语一样大,则不会计算欧几里得这些向量的距离会非常昂贵吗?)
Determine the Euclidean distance from each of the q
sentences to the random k
clusters, reposition clusters, etc. (If n
is very large like the English language, wouldn't calculating the Euclidean distance for these vectors be very costly?)
感谢您的见解!
推荐答案
发表评论的时间过长.
如果您有文档术语矩阵,请查找(协方差矩阵)的主要成分.确定主成分空间中原始数据的系数.您可以在此空间中进行k均值聚类.
If you have a document term matrix, then find the principal components (of the covariance matrix). Determine the coefficients of the original data in the principal component space. You can do k-means clustering in this space.
对于文本数据,通常需要一堆尺寸-20、50、100甚至更大.另外,我会推荐高斯混合模型/期望最大化聚类而不是k-means,但这是另一回事了.
With text data, you generally need a bunch of dimensions -- 20, 50, 100, or even more. Also, I would recommend Gaussian mixture models/expectation-maximization clustering instead of k-means, but that is another story.
这篇关于文本聚类的k均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!