python中的单词聚类列表 [英] clustering list of words in python

查看:154
本文介绍了python中的单词聚类列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是文本挖掘的新手,这是我的情况. 假设我有一个单词列表[[car],'dog','puppy','vehicle'],我想将单词聚类为k组,我希望输出为[['car','vehicle' ],['dog','puppy']]. 我首先计算每个成对单词的相似度得分,以获得4x4矩阵(在这种情况下)M,其中Mij是单词i和j的相似度得分. 将单词转换为数值数据后,我利用了不同的聚类库(例如sklearn)或自己实现了聚类.

I am a newbie in text mining, here is my situation. Suppose i have a list of words ['car', 'dog', 'puppy', 'vehicle'], i would like to cluster words into k groups, I want the output to be [['car', 'vehicle'], ['dog', 'puppy']]. I first calculate similarity score of each pairwise word to obtain a 4x4 matrix(in this case) M, where Mij is the similarity score of word i and j. After transforming the words into numeric data, i utilize different clustering library(such as sklearn) or implement it by myself to get the word clusters.

我想知道这种方法是否有意义?此外,如何确定k的值?更重要的是,我知道存在不同的聚类技术,我在考虑是否应该使用k-means或k-medoids进行单词聚类?

I want to know does this approach makes sense? Besides, how do I determine the value of k? More importantly, i know that there exist different clustering technique, i am thinking whether i should use k-means or k-medoids for word clustering?

推荐答案

Brian O'Donnell ,一旦您使用 word2vec (或 GLoVE ,...),然后可以使用 sklearn.clustering .我发现对于较小的矩阵,谱聚类可提供最佳结果.

Following up the answer by Brian O'Donnell, once you've computed the semantic similarity with word2vec (or FastText or GLoVE, ...), you can then cluster the matrix using sklearn.clustering. I've found that for small matrices, spectral clustering gives the best results.

值得记住的是,单词向量通常嵌入在高维球体中.具有欧几里得距离矩阵的K均值无法捕捉到这一点,并且可能会导致与不是直接邻居的单词相似度较差的结果.

It's worth keeping in mind that the word vectors are often embedded on a high-dimensional sphere. K-means with a Euclidean distance matrix fails to capture this, and may lead to poor results for the similarity of words that aren't immediate neighbors.

这篇关于python中的单词聚类列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆