估算句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating 'approximate' semantic similarity between sentences?

查看：66 发布时间：2020/5/4 9:06:29 python nlp machine-learning data-mining nltk

本文介绍了估算句子之间“近似"语义相似性的一些好方法是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在过去的几个小时中，我一直在查看SO上的nlp标签，并且我相信自己没有错过任何事情，但是如果有的话，请指出我的问题.

I have been looking at the nlp tag on SO for the past couple of hours and am confident I did not miss anything but if I did, please do point me to the question.

在此同时，我将描述我想做的事情.我在许多帖子中观察到的一个普遍概念是语义相似性很困难.例如，在此帖子中，被接受的解决方案建议以下内容:

In the mean time though, I will describe what I am trying to do. A common notion that I observed on many posts is that semantic similarity is difficult. For instance, from this post, the accepted solution suggests the following:

First of all, neither from the perspective of computational 
linguistics nor of theoretical linguistics is it clear what 
the term 'semantic similarity' means exactly. .... 
Consider these examples:

Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.

Which of the sentences 2-4 are similar to 1? 2 is the exact 
opposite of 1, still it is about Pete and Rob (not) finding a 
dog.

我的高层要求是利用k-means聚类并基于语义相似性对文本进行分类，因此我只需要知道它们是否为近似匹配即可.例如，在上面的示例中，我可以将1,2,4,5归为一类，将3归为另一类(当然，将用另外一些相似的句子来备份3).可以找到相关的文章，但不必一定是100％相关的.

My high-level requirement is to utilize k-means clustering and categorize the text based on semantic similarity so all I need to know is whether they are an approximate match. For instance, in the above example, I am OK with classifying 1,2,4,5 into one category and 3 into another (of course, 3 will be backed up with some more similar sentences). Something like, find related articles, but they don't have to be 100% related.

我认为我最终需要构建每个句子的向量表示形式，有点像它的指纹，但是这个向量究竟应该包含什么仍然对我来说是一个悬而未决的问题.是n-gram，还是来自词网的东西，还是只是单个词干的词，还是其他的东西?

I am thinking I need to ultimately construct vector representations of each sentence, sort of like its fingerprint but exactly what this vector should contain is still an open question for me. Is it n-grams, or something from the wordnet or just the individual stemmed words or something else altogether?

这线程在列举所有相关技术方面做得非常出色，但不幸的是，当帖子达到我想要的功能时，线程就停止了.关于该领域的最新技术有何建议?

This thread did a fantastic job of enumerating all related techniques but unfortunately stopped just when the post got to what I wanted. Any suggestions on what is the latest state-of-the-art in this area?

估算句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating 'approximate' semantic similarity between sentences?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

估算句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating &#39;approximate&#39; semantic similarity between sentences?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

估算句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating 'approximate' semantic similarity between sentences?

登录关闭