如何使用BERT来对相似的句子进行聚类 [英] How to cluster similar sentences using BERT

查看:293
本文介绍了如何使用BERT来对相似的句子进行聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于ElMo,FastText和Word2Vec,我对一个句子中的单词嵌入进行平均,并使用HDBSCAN/KMeans聚类对相似的句子进行分组.

For ElMo, FastText and Word2Vec, I'm averaging the word embeddings within a sentence and using HDBSCAN/KMeans clustering to group similar sentences.

在这篇简短的文章中可以看到一个很好的实现示例:

A good example of the implementation can be seen in this short article: http://ai.intelligentonlinetools.com/ml/text-clustering-word-embedding-machine-learning/

我想使用BERT(使用拥抱脸的BERT python包)做同样的事情,但是我不熟悉如何提取原始单词/句子向量以便将它们输入到聚类算法中.我知道BERT可以输出句子表示形式-那么我实际上将如何从句子中提取原始向量呢?

I would like to do the same thing using BERT (using the BERT python package from hugging face), however I am rather unfamiliar with how to extract the raw word/sentence vectors in order to input them into a clustering algorithm. I know that BERT can output sentence representations - so how would I actually extract the raw vectors from a sentence?

任何信息都会有所帮助.

Any information would be helpful.

推荐答案

您可以使用句子转换器生成句子嵌入.与从bert-as-service获得的嵌入相比,这些嵌入的意义要大得多,因为它们已经过微调,以使语义相似的句子具有更高的相似性评分.如果要聚类的句子数百万或更多,则可以使用基于FAISS的聚类算法,因为像聚类算法这样的香草K均值需要花费二次时间.

You can use Sentence Transformers to generate the sentence embeddings. These embeddings are much more meaningful as compared to the one obtained from bert-as-service, as they have been fine-tuned such that semantically similar sentences have higher similarity score. You can use FAISS based clustering algorithm if number of sentences to be clustered are in millions or more as vanilla K-means like clustering algorithm takes quadratic time.

这篇关于如何使用BERT来对相似的句子进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆