估计句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating 'approximate' semantic similarity between sentences?

查看：25 发布时间：2022/1/2 17:40:30 python nlp machine-learning data-mining nltk

本文介绍了估计句子之间“近似"语义相似性的一些好方法是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

过去几个小时我一直在查看 SO 上的 nlp 标签，我相信我没有错过任何东西，但如果我错过了，请务必指出问题所在.

I have been looking at the nlp tag on SO for the past couple of hours and am confident I did not miss anything but if I did, please do point me to the question.

与此同时，我将描述我正在尝试做的事情.我在许多帖子中观察到的一个常见概念是语义相似性很难.例如，从这篇帖子中，公认的解决方案建议如下:

In the mean time though, I will describe what I am trying to do. A common notion that I observed on many posts is that semantic similarity is difficult. For instance, from this post, the accepted solution suggests the following:

First of all, neither from the perspective of computational 
linguistics nor of theoretical linguistics is it clear what 
the term 'semantic similarity' means exactly. .... 
Consider these examples:

Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.

Which of the sentences 2-4 are similar to 1? 2 is the exact 
opposite of 1, still it is about Pete and Rob (not) finding a 
dog.

我的高级要求是利用 k-means 聚类并根据语义相似性对文本进行分类，因此我只需要知道它们是否近似匹配.例如，在上面的例子中，我可以将 1,2,4,5 归入一个类别，将 3 归入另一个类别(当然，3 会用一些更相似的句子来支持).就像，找到相关的文章，但它们不必是 100% 相关的.

My high-level requirement is to utilize k-means clustering and categorize the text based on semantic similarity so all I need to know is whether they are an approximate match. For instance, in the above example, I am OK with classifying 1,2,4,5 into one category and 3 into another (of course, 3 will be backed up with some more similar sentences). Something like, find related articles, but they don't have to be 100% related.

我想我最终需要构建每个句子的向量表示，有点像它的指纹，但是这个向量应该包含什么对我来说仍然是一个悬而未决的问题.是 n-gram，还是来自 wordnet 的东西，或者只是个别的词干词还是其他的东西?

I am thinking I need to ultimately construct vector representations of each sentence, sort of like its fingerprint but exactly what this vector should contain is still an open question for me. Is it n-grams, or something from the wordnet or just the individual stemmed words or something else altogether?

这个thread 在列举所有相关技术方面做得非常出色，但不幸的是，当帖子达到我想要的内容时就停止了.关于该领域的最新技术水平有什么建议吗?

This thread did a fantastic job of enumerating all related techniques but unfortunately stopped just when the post got to what I wanted. Any suggestions on what is the latest state-of-the-art in this area?

估计句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating 'approximate' semantic similarity between sentences?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

估计句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating &#39;approximate&#39; semantic similarity between sentences?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

估计句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating 'approximate' semantic similarity between sentences?

登录关闭