估计句子之间“近似"语义相似性的一些好方法是什么? [英] What are some good ways of estimating 'approximate' semantic similarity between sentences?
问题描述
过去几个小时我一直在查看 SO 上的 nlp 标签,我相信我没有错过任何东西,但如果我错过了,请务必指出问题所在.
I have been looking at the nlp tag on SO for the past couple of hours and am confident I did not miss anything but if I did, please do point me to the question.
与此同时,我将描述我正在尝试做的事情.我在许多帖子中观察到的一个常见概念是语义相似性很难.例如,从这篇帖子中,公认的解决方案建议如下:
In the mean time though, I will describe what I am trying to do. A common notion that I observed on many posts is that semantic similarity is difficult. For instance, from this post, the accepted solution suggests the following:
First of all, neither from the perspective of computational
linguistics nor of theoretical linguistics is it clear what
the term 'semantic similarity' means exactly. ....
Consider these examples:
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.
Which of the sentences 2-4 are similar to 1? 2 is the exact
opposite of 1, still it is about Pete and Rob (not) finding a
dog.
我的高级要求是利用 k-means 聚类并根据语义相似性对文本进行分类,因此我只需要知道它们是否近似匹配.例如,在上面的例子中,我可以将 1,2,4,5 归入一个类别,将 3 归入另一个类别(当然,3 会用一些更相似的句子来支持).就像,找到相关的文章,但它们不必是 100% 相关的.
My high-level requirement is to utilize k-means clustering and categorize the text based on semantic similarity so all I need to know is whether they are an approximate match. For instance, in the above example, I am OK with classifying 1,2,4,5 into one category and 3 into another (of course, 3 will be backed up with some more similar sentences). Something like, find related articles, but they don't have to be 100% related.
我想我最终需要构建每个句子的向量表示,有点像它的指纹,但是这个向量应该包含什么对我来说仍然是一个悬而未决的问题.是 n-gram,还是来自 wordnet 的东西,或者只是个别的词干词还是其他的东西?
I am thinking I need to ultimately construct vector representations of each sentence, sort of like its fingerprint but exactly what this vector should contain is still an open question for me. Is it n-grams, or something from the wordnet or just the individual stemmed words or something else altogether?
这个thread 在列举所有相关技术方面做得非常出色,但不幸的是,当帖子达到我想要的内容时就停止了.关于该领域的最新技术水平有什么建议吗?
This thread did a fantastic job of enumerating all related techniques but unfortunately stopped just when the post got to what I wanted. Any suggestions on what is the latest state-of-the-art in this area?
推荐答案
潜在语义建模可以有用.它基本上只是奇异值分解的又一个应用.SVDLIBC 是这种方法的一个相当不错的 C 实现,这是一个古老但很好的东西,甚至还有sparsesvd形式的python绑定.
Latent Semantic Modeling could be useful. It's basically just yet another application of the Singular Value Decomposition. The SVDLIBC is a pretty nice C implementation of this approach, which is an oldie but a goodie, and there are even python binding in the form of sparsesvd.
这篇关于估计句子之间“近似"语义相似性的一些好方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!