基于一组文档中相似度对句子进行排名的最佳方法 [英] Best way to rank sentences based on similarity from a set of Documents

查看:119
本文介绍了基于一组文档中相似度对句子进行排名的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道根据一组文档中的相似性对句子进行排名的最佳方法。

例如,说

1.有5个文档。

2.每个文档包含很多句子。

3.让文档1作为主要文档,即输出将包含此文档中的句子。

4.输出应为以这样的方式对句子列表进行排序:在所有5个文档中,具有FIRST排名的句子是最相似的句子,然后依次是第2位和第3位...

I want to know the best way to rank sentences based on similarity from a set of documents.
For e.g lets say,
1. There are 5 documents.
2. Each document contains many sentences.
3. Lets take Document 1 as primary, i.e output will contain sentences from this document.
4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd...

推荐答案

我将介绍文本文档匹配的基础知识...

I'll cover the basics of textual document matching...

大多数文档相似性度量都是基于单词而不是句子结构。第一步通常是茎梗。单词被还原为它们的词根形式,因此类似单词的不同形式,例如游泳和游泳匹配。

Most document similarity measures work on a word basis, rather than sentence structure. The first step is usually stemming. Words are reduced to their root form, so that different forms of similar words, e.g. "swimming" and "swims" match.

此外,您可能希望过滤匹配的单词以避免噪音。特别是,您可能希望忽略出现 the和 a。实际上,您可能希望省略很多连词和代词,因此通常您会收到一长串这样的单词-这称为 停止列表

Additionally, you may wish to filter the words you match to avoid noise. In particular, you may wish to ignore occurances of "the" and "a". In fact, there's a lot of conjunctions and pronouns that you may wish to omit, so usually you will have a long list of such words - this is called "stop list".

此外,您可能希望避免使用一些不好的单词,例如说脏话或种族。 words骂。因此,您可能还有另一个包含此类单词的排除列表,即错误列表。

Furthermore, there may be bad words you wish to avoid matching, such as swear words or racial slur words. So you may have another exclusion list with such words in it, a "bad list".

因此,现在您可以在文档中计算相似的单词了。问题变成了如何衡量文档的整体相似度。您需要创建一个得分函数,将相似的单词作为输入并给出相似性的值。如果同一单词在两个文档中多次出现,则此函数应具有较高的价值。此外,此类匹配项会根据总词频进行加权,这样,当不常见的词匹配时,它们的统计权重就会提高。

So now you can count similar words in documents. The question becomes how to measure total document similarity. You need to create a score function that takes as input the similar words and gives a value of "similarity". Such a function should give a high value if the same word appears multiple times in both documents. Additionally, such matches are weighted by the total word frequency so that when uncommon words match, they are given more statistical weight.

Apache Lucene 是用Java编写的开源搜索引擎,提供了有关这些步骤的实用细节。例如,以下是有关它们如何加权查询相似性的信息:

Apache Lucene is an open-source search engine written in Java that provides practical detail about these steps. For example, here is the information about how they weight query similarity:

http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html


Lucene将信息检索的布尔模型(BM)与信息检索的
向量空间模型(VSM)相结合-

Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.

所有这些实际上都是关于匹配文档中的单词的。您确实指定了匹配的句子。对于大多数人而言,匹配单词会更加有用,因为您可以拥有各种含义相同的句子结构。最有用的相似性信息就是文字。我已经讨论过文档匹配,但是出于您的目的,句子只是一个很小的文档。

All of this is really just about matching words in documents. You did specify matching sentences. For most people's purposes, matching words is more useful as you can have a huge variety of sentence structures that really mean the same thing. The most useful information of similarity is just in the words. I've talked about document matching, but for your purposes, a sentence is just a very small document.

现在,顺便说一句,如果您不关心句子中的实际名词和动词只关心语法组成,您需要使用其他方法...

Now, as an aside, if you don't care about the actual nouns and verbs in the sentence and only care about grammar composition, you need a different approach...

首先,您需要一个链接语法解析器来解释该语言并构建表示该句子的数据结构(通常是树)。然后,您必须执行不精确的图匹配。这是一个难题,但是有多项算法可以在多项式时间内对树执行此操作。

First you need a link grammar parser to interpret the language and build a data structure (usually a tree) that represents the sentence. Then you have to perform inexact graph matching. This is a hard problem, but there are algorithms to do this on trees in polynomial time.

这篇关于基于一组文档中相似度对句子进行排名的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆