估算句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating 'approximate' semantic similarity between sentences?

查看:66
本文介绍了估算句子之间“近似"语义相似性的一些好方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在过去的几个小时中,我一直在查看SO上的nlp标签,并且我相信自己没有错过任何事情,但是如果有的话,请指出我的问题.

I have been looking at the nlp tag on SO for the past couple of hours and am confident I did not miss anything but if I did, please do point me to the question.

在此同时,我将描述我想做的事情.我在许多帖子中观察到的一个普遍概念是语义相似性很困难.例如,在帖子中,被接受的解决方案建议以下内容:

In the mean time though, I will describe what I am trying to do. A common notion that I observed on many posts is that semantic similarity is difficult. For instance, from this post, the accepted solution suggests the following:

First of all, neither from the perspective of computational 
linguistics nor of theoretical linguistics is it clear what 
the term 'semantic similarity' means exactly. .... 
Consider these examples:

Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.

Which of the sentences 2-4 are similar to 1? 2 is the exact 
opposite of 1, still it is about Pete and Rob (not) finding a 
dog.

我的高层要求是利用k-means聚类并基于语义相似性对文本进行分类,因此我只需要知道它们是否为近似匹配即可.例如,在上面的示例中,我可以将1,2,4,5归为一类,将3归为另一类(当然,将用另外一些相似的句子来备份3).可以找到相关的文章,但不必一定是100%相关的.

My high-level requirement is to utilize k-means clustering and categorize the text based on semantic similarity so all I need to know is whether they are an approximate match. For instance, in the above example, I am OK with classifying 1,2,4,5 into one category and 3 into another (of course, 3 will be backed up with some more similar sentences). Something like, find related articles, but they don't have to be 100% related.

我认为我最终需要构建每个句子的向量表示形式,有点像它的指纹,但是这个向量究竟应该包含什么仍然对我来说是一个悬而未决的问题.是n-gram,还是来自词网的东西,还是只是单个词干的词,还是其他的东西?

I am thinking I need to ultimately construct vector representations of each sentence, sort of like its fingerprint but exactly what this vector should contain is still an open question for me. Is it n-grams, or something from the wordnet or just the individual stemmed words or something else altogether?

线程在列举所有相关技术方面做得非常出色,但不幸的是,当帖子达到我想要的功能时,线程就停止了.关于该领域的最新技术有何建议?

This thread did a fantastic job of enumerating all related techniques but unfortunately stopped just when the post got to what I wanted. Any suggestions on what is the latest state-of-the-art in this area?

推荐答案

潜在语义建模可以有用.它基本上只是奇异值分解的另一个应用. SVDLIBC 是此方法的一个很好的C实现,虽然有点老套,但也很不错,甚至还有 sparsesvd 形式的python绑定.

Latent Semantic Modeling could be useful. It's basically just yet another application of the Singular Value Decomposition. The SVDLIBC is a pretty nice C implementation of this approach, which is an oldie but a goodie, and there are even python binding in the form of sparsesvd.

这篇关于估算句子之间“近似"语义相似性的一些好方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆