如何检测文本文档中的重复项并返回重复项的相似性? [英] How to detect duplicates among text documents and return the duplicates' similarity?

查看:25
本文介绍了如何检测文本文档中的重复项并返回重复项的相似性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个爬虫来从某个网站获取内容,但内容可以复制,我想要为了避免这种情况.所以我需要一个函数可以在两个文本之间返回相同的百分比来检测两个可能重复的内容示例:

I'm writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example:

  • 文本 1:我正在写一个爬虫"
  • 文本 2:我正在编写一些文本爬虫来获取"

比较函数将文本 2 作为相同文本 1 返回 5/8%(其中 5 是文本 2 相同文本 1 的字数(按词序比较),8 是文本 2 的总字数).如果删除某些文本",则文本 2 与文本 1 相同(我需要检测情况).我该怎么做?

The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some text" then text 2 as the same text 1(I need detect the situation).How can I do that?

推荐答案

您正面临一个在 信息检索作为近似重复检测.

You are facing a problem which is known in the field of Information Retrieval as Near Duplicates Detection.

已知的解决方案之一是使用 Jaccard-Similarity 用于获取两个文档之间的差异.

One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.

Jaccard 相似度基本上是 - 从每个文档中获取单词集,让这些集为 s1s2 - Jaccard 相似度为 |s1 [intersection] s2|/|s1 [联合] s2|.

Jaccard Similarity is basically - get sets of words from each document, let these sets be s1 and s2 - and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|.

通常在面对几乎重复的情况时 - 然而,单词的顺序有一定的重要性.为了处理它 - 在生成集合 s1s2 时 - 您实际上生成了 k-shinglings 的集合,而不是仅单词的集合.
在您的示例中,使用 k=2,集合将是:

Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1 and s2 - you actually generate sets of k-shinglings, instead sets of only words.
In your example, with k=2, the sets will be:

s1 = { I'm write, write a, a crawler, crawler to }
s2 = { I'm write, write a, a some, some text, text crawler, crawler to, to get }
s1 [union] s2 = { I'm write, write a, a crawler, crawler to, a some, some text, text crawler, to get } 
s1 [intersection] s2 = { I'm write, write a, crawler to }

在上面,jaccard-similarity 将是 3/8.如果您使用相同的方法使用单个单词,(k=1 shinglings) 您将获得所需的 5/8 - 但在我(和大多数 IR 专家)看来,这是更糟糕的解决方案.

In the above, the jaccard-similarity will be 3/8. If you use single words with the same approach, (k=1 shinglings) you will get your desired 5/8 - but this is worse solution in my (and most IR experts) opinion.

这个过程可以很好地扩展以非常有效地处理大量集合,而无需检查所有对并创建大量集合.更多细节可以在这些讲义(我几个月前做了这个讲座,根据作者的笔记).

This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture few months ago, based on the author's notes).

这篇关于如何检测文本文档中的重复项并返回重复项的相似性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆