如何检测两个句子相似? [英] How to detect that two sentences are similar?

查看:36
本文介绍了如何检测两个句子相似?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算两个任意句子彼此之间的相似程度.例如:

<块引用>

  1. 一位数学家找到了解决问题的方法.
  2. 这个问题是由一位年轻的数学家解决的.

我可以使用标记器、词干分析器和解析器,但我不知道如何检测这些句子是否相似.

解决方案

这两句话不仅相似,几乎是释义,即表达相同含义的两种可选方式.这也是一个非常简单的释义案例,其中两个话语都使用相同的词,唯一的例外是一个是主动形式而另一个是被动形式.(这两个句子不完全是释义,因为在第二个句子中数学家是年轻的".这个额外的信息使得两个句子之间的语义关系不对称.在这些情况下,你会说第二个话语 包含"第一个,或者换句话说,第一个可以从第二个中推断出来).

从示例中无法理解您是否真的对释义检测、文本蕴涵或一般的句子相似性感兴趣,这是一个更广泛和更模糊的问题.例如,人们吃食物"更类似于人们吃面包"还是男人吃食物"?

释义检测和文本相似度都是自然语言处理中复杂、开放的研究问题,有大量活跃的研究人员致力于解决这些问题.目前尚不清楚您对该主题的兴趣程度如何,但请考虑到,尽管许多杰出的研究人员花费了整个职业生涯并试图破解它,但我们距离找到通用的合理解决方案还很远.

除非您对仅适用于特定情况且无法捕获句法交替的非常肤浅的解决方案感兴趣(如本例),否则我建议您更深入地研究文本相似性问题.一个很好的起点是统计自然语言处理基础"这本书,它对大多数统计数据进行了组织良好的介绍.自然语言处理主题.一旦你明确了你的要求(例如,你的方法应该在什么条件下工作?你追求什么水平的精度/召回率?你可以安全地忽略什么样的现象,你需要考虑哪些现象?)通过深入研究最近的研究工作,开始研究具体的方法.在这里,一个好的起点是计算语言学协会 (ACL) 的在线档案,它是大多数研究的出版商结果在现场.

只是为了给你一些实用的东西,句子相似度的一个非常粗略的基线是余弦相似度将句子表示为词袋的两个二元向量.词袋是文本的一种非常简化的表示,常用于信息检索,其中你完全不考虑语法,只将一个句子表示为一个向量,其大小为词汇表的大小(即语言中单词的数量)) 并且如果词汇表中位置i"的单词出现在句子中,则其组件i"的值为1",否则为0".

I want to compute how similar two arbitrary sentences are to each other. For example:

  1. A mathematician found a solution to the problem.
  2. The problem was solved by a young mathematician.

I can use a tagger, a stemmer, and a parser, but I don’t know how detect that these sentences are similar.

解决方案

These two sentences are not just similar, they are almost paraphrases, i.e., two alternative ways of expressing the same meaning. It is also a very simple case of paraphrase, in which both utterances use the same words with the only exception of one being in active form while the other is passive. (The two sentences are not exactly paraphrases because in the second sentence the mathematician is "young". This additional information makes the semantic relation between the two sentences non symmetric. In these cases, you would say that the second utterance "entails" the first one, or in other words that the first can be inferred from the second).

From the example it is not possible to understand whether you are actually interested in paraphrase detection, textual entailment or in sentence similarity in general, which is an even broader and fuzzier problem. For example, is "people eat food" more similar to "people eat bread" or to "men eat food"?

Both paraphrase detection and text similarity are complex, open research problems in Natural Language Processing, with a large and active community of researchers working on them. It is not clear what is the extent of your interest in this topic, but consider that even though many brilliant researchers have spent and spend their whole careers trying to crack it, we are still very far from finding sound solutions that just work in general.

Unless you are interested in a very superficial solution that would only work in specific cases and that would not capture syntactic alternation (as in this case), I would suggest that you look into the problem of text similarity in more depth. A good starting point would be the book "Foundations of Statistical Natural Language Processing", which provides a very well organised presentation of most statistical natural language processing topics. Once you have clarified your requirements (e.g., under what conditions is your method supposed to work? what levels of precision/recall are you after? what kind of phenomena can you safely ignore, and which ones you need to account for?) you can start looking into specific approaches by diving into recent research work. Here, a good place to start would be the online archives of the Association for Computational Linguistics (ACL), which is the publisher of most research results in the field.

Just to give you something practical to work with, a very rough baseline for sentence similarity would be the cosine similarity between two binary vectors representing the sentences as bags of words. A bag of word is a very simplified representation of text, commonly used for information retrieval, in which you completely disregard syntax and only represent a sentence as a vector whose size is the size of the vocabulary (i.e., the number of words in the language) and whose component "i" is valued "1" if the word at position "i" in the vocabulary appears in the sentence, and "0" otherwise.

这篇关于如何检测两个句子相似?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆