Web爬网中近乎重复的页面检测的一种新颖有效的方法 [英] A Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling

查看：242 发布时间：2019/6/21 16:53:36 C# ASP.NET .NET Internet web-dev

本文介绍了Web爬网中近乎重复的页面检测的一种新颖有效的方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有人可以帮我吗?
我想为我的学者做这个项目.
有人可以告诉我如何执行此操作吗?
摘要:
近年来，万维网的迅猛发展使得网络爬网"的概念具有重要的意义.数量庞大的网络文档使网络搜索引擎面临巨大挑战，使其搜索结果与用户的相关性降低.大量重复和几乎重复的Web文档的存在为搜索引擎带来了额外的开销，从而严重影响了它们的性能和质量. Web爬网研究社区早已认识到检测重复和几乎重复的网页.搜索引擎的重要要求是在第一页中为用户提供其查询的相关结果，而没有重复和重复的结果.在本文中，我们提出了一种新颖有效的方法来检测网络爬网中几乎重复的网页.在将爬网的网页存储到存储库之前，先执行几乎重复的网页的检测.首先，从爬网的页面中提取关键词，并基于提取的关键词来计算两个页面之间的相似度得分.具有大于阈值的相似性分数的文档被视为接近重复.通过检测，可以减少存储库的内存并提高搜索引擎的质量.

Can anyone help me in this.
I want to do this project for my academic.
Can some one give me any idea how to do this.
Abstract:
The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overheads for the search engines critically affecting their performance and quality. The detection of duplicate and near duplicate web pages has long been recognized in web crawling research community. It is an important requirement for search engines to provide users with the relevant results for their queries in the first page without duplicate and redundant results. In this paper, we have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling. Detection of near duplicate web pages is carried out ahead of storing the crawled web pages in to repositories. At first, the keywords are extracted from the crawled pages and the similarity score between two pages is calculated based on the extracted keywords. The documents having similarity scores greater than a threshold value are considered as near duplicates. The detection has resulted in reduced memory for repositories and improved search engine quality.

Web爬网中近乎重复的页面检测的一种新颖有效的方法 [英] A Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

Web爬网中近乎重复的页面检测的一种新颖有效的方法 [英] A Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭