给定数十亿个 URL,如何确定重复内容 [英] Given billions of URLs, how to determine duplicate content

查看:22
本文介绍了给定数十亿个 URL,如何确定重复内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在编程面试中被问到这个问题.我在下面详细描述了这个问题.这是一个开放式问题.

I was asked this question in a programming interview. I have described the question in detail below. It was an open-ended question.

鉴于有数十亿个 URL(深层链接),我该如何分类哪些 URL 指向重复的内容.问题进一步扩展到找出在重复页面的情况下,哪些页面是真实的.这是第一部分.我的方法(基于有效假设)是根据域对它们进行分类,然后匹配同一存储桶中的 URL 内容.

Given billions of URLs(deep links), how do I classify that which URLs point to the duplicate content. The question was further extended to finding out that in cases of duplicate pages, which of them was authentic. This was the first part. My approach (with valid assumptions) was to classify them on the basis of domains and then match the contents of URLs in the same bucket.

在第二部分,面试官缩小了问题范围,指出:仅给出两个 URL,URL1 是关于名人的 wiki 页面(例如:布拉德皮特),而 URL2 包含有关包括布拉德皮特在内的许多名人的信息.我们如何识别哪个是真实的,哪个是重复的?我的回答是基于引用的两个页面进行比较.

In the second part, the interviewer narrowed down the question stating that: Given just two URLs, URL1 is a wiki page about a celebrity, (eg: Brad Pitt) and URL2 contains information about many celebrities including Brad Pitt. How do we identify which one is authentic and which is duplicate ? My answer was based on comparing the two pages on the basis of their citations.

面试官让我从头开始构建答案,并希望我假设我们没有任何关于 URL 上重复内容的先验信息.由于这是一个开放式问题,任何线索都会被证明是有帮助的.

The interviewer asked me to build the answer from scratch, and wanted me to assume that we don't have any prior information about duplicate content on the URLs. Since its an open-ended question, any lead would prove helpful.

推荐答案

您可能会发现这篇论文很有帮助:"查找近似重复的网页:算法的大规模评估",作者是 Google 的 Monika Henzinger,因为这个问题已经吸引了大量的研究.来自论文:

You might find this paper to be helpful: "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms" by Monika Henzinger at Google, as this problem has attracted a fair amount of research. From the paper:

一个简单的解决方案是将所有对与文档进行比较.由于这是在大型数据集上成本过高,Manber [11] 和 Heintze [9]提出了第一个算法来检测接近重复的文档减少比较次数.两种算法都适用于相邻字符.布林等人.1 开始使用单词序列检测侵犯版权.Shivakumar 和 Garcia-Molina [13, 14]继续这项研究并专注于将其扩展到数 GB数据库 [15].布罗德等人.[3] 还使用了词序列来有效地找到接近重复的网页.后来,查里卡 [4]开发了一种基于单词随机投影的方法文档.最近 Hoad 和 Zobel [10] 开发并比较了方法用于识别版本化和抄袭的文档.

A naive solution is to compare all pairs to documents. Since this is prohibitively expensive on large datasets, Manber [11] and Heintze [9] proposed first algorithms for detecting near-duplicate documents with a reduced number of comparisons. Both algorithms work on sequences of adjacent characters. Brin et al. 1 started to use word sequences to detect copyright violations. Shivakumar and Garcia-Molina [13, 14] continued this research and focused on scaling it up to multi-gigabyte databases [15]. Broder et al. [3] also used word sequences to efficiently find near duplicate web pages. Later, Charikar [4] developed an approach based on random projections of the words in a document. Recently Hoad and Zobel [10] developed and compared methods for identifying versioned and plagiarised documents.

换句话说,这是一个复杂的问题,有多种成功的解决方案,而不是具有正确"答案的问题.大多数答案都涉及检查单词或字符序列.

In other words, it's a complex problem with a variety of solutions of varying success, and not something with a 'right' answer. Most of the answers involve checking word or character sequences.

这篇关于给定数十亿个 URL,如何确定重复内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆