如何检测文本文档中复制并返回重复的“相似性? [英] How to detect duplicates among text documents and return the duplicates' similarity?

查看:228
本文介绍了如何检测文本文档中复制并返回重复的“相似性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个爬虫来从一些网站的内容,但内容可以复制的,我想 以避免这种情况。所以我需要一个函数可以返回两个文本之间的相同的百分比来检测两个内容可能重复的例子:

I'm writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example:

  • 文本1:我正在写一个爬虫
  • 文本2:我在写一些文字履带获得

比较功能将提高5/8%的回报率文本2为同一文本1(5是文本2相同的文本1(词序比较)字样号和8是总字文本2)。如果去掉一些文本,则文本2的相同的文本1(我需要检测的情况)我。如何能做到这一点?

The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some text" then text 2 as the same text 1(I need detect the situation).How can I do that?

推荐答案

您正面临着被称为在信息领域的问题检索近重复检测

You are facing a problem which is known in the field of Information Retrieval as Near Duplicates Detection.

一种已知的解决方案,它是使用 杰卡德相似 作为获取两个文件之间的差异。

One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.

杰卡德相似性基本上是 - 拿到套语从每个文档,让这些集是 S1 S2 - 而杰卡德相似度 | S1 [交集] S2 | / | S [联盟] S2 |

Jaccard Similarity is basically - get sets of words from each document, let these sets be s1 and s2 - and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|.

一般面对邻近重复时 - 字的顺序具有一定的重要性却。为了处理它 - 生成集时 S1 S2 - 你实际生成套K-shinglings,而不是设置的唯一的话。
在你的榜样,用 K = 2 ,该组将是:

Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1 and s2 - you actually generate sets of k-shinglings, instead sets of only words.
In your example, with k=2, the sets will be:

s1 = { I'm write, write a, a crawler, crawler to }
s2 = { I'm write, write a, a some, some text, text crawler, crawler to, to get }
s1 [union] s2 = { I'm write, write a, a crawler, crawler to, a some, some text, text crawler, to get } 
s1 [intersection] s2 = { I'm write, write a, crawler to }

在上面的杰卡德相似性将 3/8 。如果使用单个词汇具有相同的方法,(K = 1 shinglings),你会得到你想要的 5/8 - 但这是我的(也是最IR专家更糟糕的解决方案)的意见。

In the above, the jaccard-similarity will be 3/8. If you use single words with the same approach, (k=1 shinglings) you will get your desired 5/8 - but this is worse solution in my (and most IR experts) opinion.

这个过程可以很好地扩展到处理非常有效巨大的集合,而不检查所有对和创造套庞大的数字。更多细节可以在<一个被发现href="http://webcourse.cs.technion.ac.il/236375/Winter2013-2014/ho/WCFiles/tutorial_8_near_duplicates_detection.pdf">these讲义(我几个月前给了本次讲座的基础上,作者注)。

This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture few months ago, based on the author's notes).

这篇关于如何检测文本文档中复制并返回重复的“相似性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆