相似矩阵的有效聚类 [英] Effective clustering of a similarity matrix

查看：305 发布时间：2020/5/7 19:01:45 matrix machine-learning cluster-analysis distance similarity

本文介绍了相似矩阵的有效聚类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的主题是(一堆)文本的相似性和聚类.简而言之:我想将收集的文本聚在一起，并且它们应该最后出现在有意义的聚类中.为此，到目前为止，我的方法如下，我的问题在于群集.当前软件是用php编写的.

my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php.

1)相似性: 我将每个文档都视为单词袋"，然后将单词转换为向量.我用

1) Similarity: I treat every document as a "bag-of-words" and convert words into vectors. I use

过滤(仅真实"单词)
令牌化(将句子拆分成单词)
梗(将单词简化为基本形式；波特的词干)
修剪(频率过高和过低的单词的剪切)

作为降维方法.之后，我使用余弦相似度(如在Web的各个站点上和此处所建议/描述的.

as methods for dimensionality reduction. After that, I'm using cosine similarity (as suggested / described on various sites on the web and here.

然后，结果是一个类似的矩阵:

The result then is a similarity matrix like this:

        A   B   C   D   E 
    A   0  30  51  75  80
    B   X   0  21  55  70
    C   X   X   0  25  10
    D   X   X   X   0  15
    E   X   X   X   X   0

A…E是我的文字，数字是相似度(以百分比表示)；越高，文本越相似.由于sim(A，B)== sim(B，A)仅填充了矩阵的一半.因此，文本A与文本D的相似度为71％.

A…E are my texts and the number is the similarity in percent; the higher, the more similar the texts are. Because sim(A,B) == sim(B,A) only half of the matrix is filled in. So the similarity of Text A to Text D is 71%.

我现在想从此矩阵中生成先验未知数(！)的聚类.聚类应该一起代表相似的项目(直到某个停止标准).

I want to generate a a priori unknown(!) number of clusters out of this matrix now. The clusters should represent the similar items (up to a certain stopp criterion) together.

我自己尝试了一个基本实现，基本上就是这样(60％是固定的相似性阈值)

I tried a basic implementation myself, which was basically like this (60% as a fixed similarity threshold)

    foreach article
      get similar entries where sim > 60
              foreach similar entry
              check if one of the entries already has a cluster number
              if no: assign new cluster number to all similar entries
              if yes: use that number

(某种程度上)它起作用了，但是一点也不好，结果往往是庞然大物. 因此，我想重做一次，并且已经研究过各种聚类算法，但是我仍然不确定哪一种效果最好.我认为这应该是一种聚集算法，因为每对文本在开始时都可以看作是一个集群.但是仍然存在的问题是停止标准是什么，以及算法是否应该将现有的群集分开和/或合并在一起.

It worked (somehow), but wasn't good at all and the results were often monster-clusters. So, I want to redo this and already had a look into all kinds of clustering algorithms, but I'm still not sure which one will work best. I think it should be an agglomerative algoritm, because every pair of texts can be seen as a cluster in the beginning. But still the questions are what the stopp criterion is and if the algorithm should divide and / or merge existing clusters together.

很抱歉，其中一些内容似乎很基础，但是我在这个领域还比较陌生.感谢您的帮助.

Sorry if some of the stuff seems basic, but I am relatively new in this field. Thanks for the help.

相似矩阵的有效聚类 [英] Effective clustering of a similarity matrix

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

相似矩阵的有效聚类 [英] Effective clustering of a similarity matrix

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭