估计句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating 'approximate' semantic similarity between sentences?

查看:25
本文介绍了估计句子之间“近似"语义相似性的一些好方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去几个小时我一直在查看 SO 上的 nlp 标签,我相信我没有错过任何东西,但如果我错过了,请务必指出问题所在.

I have been looking at the nlp tag on SO for the past couple of hours and am confident I did not miss anything but if I did, please do point me to the question.

与此同时,我将描述我正在尝试做的事情.我在许多帖子中观察到的一个常见概念是语义相似性很难.例如,从这篇帖子中,公认的解决方案建议如下:

In the mean time though, I will describe what I am trying to do. A common notion that I observed on many posts is that semantic similarity is difficult. For instance, from this post, the accepted solution suggests the following:

First of all, neither from the perspective of computational 
linguistics nor of theoretical linguistics is it clear what 
the term 'semantic similarity' means exactly. .... 
Consider these examples:

Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.

Which of the sentences 2-4 are similar to 1? 2 is the exact 
opposite of 1, still it is about Pete and Rob (not) finding a 
dog.

我的高级要求是利用 k-means 聚类并根据语义相似性对文本进行分类,因此我只需要知道它们是否近似匹配.例如,在上面的例子中,我可以将 1,2,4,5 归入一个类别,将 3 归入另一个类别(当然,3 会用一些更相似的句子来支持).就像,找到相关的文章,但它们不必是 100% 相关的.

My high-level requirement is to utilize k-means clustering and categorize the text based on semantic similarity so all I need to know is whether they are an approximate match. For instance, in the above example, I am OK with classifying 1,2,4,5 into one category and 3 into another (of course, 3 will be backed up with some more similar sentences). Something like, find related articles, but they don't have to be 100% related.

我想我最终需要构建每个句子的向量表示,有点像它的指纹,但是这个向量应该包含什么对我来说仍然是一个悬而未决的问题.是 n-gram,还是来自 wordnet 的东西,或者只是个别的词干词还是其他的东西?

I am thinking I need to ultimately construct vector representations of each sentence, sort of like its fingerprint but exactly what this vector should contain is still an open question for me. Is it n-grams, or something from the wordnet or just the individual stemmed words or something else altogether?

这个thread 在列举所有相关技术方面做得非常出色,但不幸的是,当帖子达到我想要的内容时就停止了.关于该领域的最新技术水平有什么建议吗?

This thread did a fantastic job of enumerating all related techniques but unfortunately stopped just when the post got to what I wanted. Any suggestions on what is the latest state-of-the-art in this area?

推荐答案

潜在语义建模可以有用.它基本上只是奇异值分解的又一个应用.SVDLIBC 是这种方法的一个相当不错的 C 实现,这是一个古老但很好的东西,甚至还有sparsesvd形式的python绑定.

Latent Semantic Modeling could be useful. It's basically just yet another application of the Singular Value Decomposition. The SVDLIBC is a pretty nice C implementation of this approach, which is an oldie but a goodie, and there are even python binding in the form of sparsesvd.

这篇关于估计句子之间“近似"语义相似性的一些好方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆