相似矩阵的有效聚类 [英] Effective clustering of a similarity matrix

查看:305
本文介绍了相似矩阵的有效聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的主题是(一堆)文本的相似性和聚类.简而言之:我想将收集的文本聚在一起,并且它们应该最后出现在有意义的聚类中.为此,到目前为止,我的方法如下,我的问题在于群集.当前软件是用php编写的.

my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php.

1)相似性: 我将每个文档都视为单词袋",然后将单词转换为向量.我用

1) Similarity: I treat every document as a "bag-of-words" and convert words into vectors. I use

  • 过滤(仅真实"单词)
  • 令牌化(将句子拆分成单词)
  • 梗(将单词简化为基本形式;波特的词干)
  • 修剪(频率过高和过低的单词的剪切)

作为降维方法.之后,我使用余弦相似度(如在Web的各个站点上和此处所建议/描述的.

as methods for dimensionality reduction. After that, I'm using cosine similarity (as suggested / described on various sites on the web and here.

然后,结果是一个类似的矩阵:

The result then is a similarity matrix like this:

        A   B   C   D   E 
    A   0  30  51  75  80
    B   X   0  21  55  70
    C   X   X   0  25  10
    D   X   X   X   0  15
    E   X   X   X   X   0

A…E是我的文字,数字是相似度(以百分比表示);越高,文本越相似.由于sim(A,B)== sim(B,A)仅填充了矩阵的一半.因此,文本A与文本D的相似度为71%.

A…E are my texts and the number is the similarity in percent; the higher, the more similar the texts are. Because sim(A,B) == sim(B,A) only half of the matrix is filled in. So the similarity of Text A to Text D is 71%.

我现在想从此矩阵中生成先验未知数(!)的聚类.聚类应该一起代表相似的项目(直到某个停止标准).

I want to generate a a priori unknown(!) number of clusters out of this matrix now. The clusters should represent the similar items (up to a certain stopp criterion) together.

我自己尝试了一个基本实现,基本上就是这样(60%是固定的相似性阈值)

I tried a basic implementation myself, which was basically like this (60% as a fixed similarity threshold)

    foreach article
      get similar entries where sim > 60
              foreach similar entry
              check if one of the entries already has a cluster number
              if no: assign new cluster number to all similar entries
              if yes: use that number

(某种程度上)它起作用了,但是一点也不好,结果往往是庞然大物. 因此,我想重做一次,并且已经研究过各种聚类算法,但是我仍然不确定哪一种效果最好.我认为这应该是一种聚集算法,因为每对文本在开始时都可以看作是一个集群.但是仍然存在的问题是停止标准是什么,以及算法是否应该将现有的群集分开和/或合并在一起.

It worked (somehow), but wasn't good at all and the results were often monster-clusters. So, I want to redo this and already had a look into all kinds of clustering algorithms, but I'm still not sure which one will work best. I think it should be an agglomerative algoritm, because every pair of texts can be seen as a cluster in the beginning. But still the questions are what the stopp criterion is and if the algorithm should divide and / or merge existing clusters together.

很抱歉,其中一些内容似乎很基础,但是我在这个领域还比较陌生.感谢您的帮助.

Sorry if some of the stuff seems basic, but I am relatively new in this field. Thanks for the help.

推荐答案

由于您都是该领域的新手,并且拥有未知数量的簇并且已经在使用余弦距离,所以我建议使用

Since you're both new to the field, have an unknown number of clusters and are already using cosine distance I would recommend the FLAME clustering algorithm.

它直观,易于实现,并且具有多种语言的实现(尽管不是PHP,主要是因为很少有人将PHP用于数据科学).

It's intuitive, easy to implement, and has implementations in a large number of languages (not PHP though, largely because very few people use PHP for data science).

更不用说,它实际上已经足够被许多人用于研究.如果没有其他事情,您可以了解一下在迁移到另一种算法时要解决的该聚类算法的确切缺点.

Not to mention, it's actually good enough to be used in research by a large number of people. If nothing else you can get an idea of what exactly the shortcomings are in this clustering algorithm that you want to address in moving onto another one.

这篇关于相似矩阵的有效聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆