Web爬网中近乎重复的页面检测的一种新颖有效的方法 [英] A Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling

查看:242
本文介绍了Web爬网中近乎重复的页面检测的一种新颖有效的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以帮我吗?
我想为我的学者做这个项目.
有人可以告诉我如何执行此操作吗?
摘要:
近年来,万维网的迅猛发展使得网络爬网"的概念具有重要的意义.数量庞大的网络文档使网络搜索引擎面临巨大挑战,使其搜索结果与用户的相关性降低.大量重复和几乎重复的Web文档的存在为搜索引擎带来了额外的开销,从而严重影响了它们的性能和质量. Web爬网研究社区早已认识到检测重复和几乎重复的网页.搜索引擎的重要要求是在第一页中为用户提供其查询的相关结果,而没有重复和重复的结果.在本文中,我们提出了一种新颖有效的方法来检测网络爬网中几乎重复的网页.在将爬网的网页存储到存储库之前,先执行几乎重复的网页的检测.首先,从爬网的页面中提取关键词,并基于提取的关键词来计算两个页面之间的相似度得分.具有大于阈值的相似性分数的文档被视为接近重复.通过检测,可以减少存储库的内存并提高搜索引擎的质量.

Can anyone help me in this.
I want to do this project for my academic.
Can some one give me any idea how to do this.
Abstract:
The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overheads for the search engines critically affecting their performance and quality. The detection of duplicate and near duplicate web pages has long been recognized in web crawling research community. It is an important requirement for search engines to provide users with the relevant results for their queries in the first page without duplicate and redundant results. In this paper, we have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling. Detection of near duplicate web pages is carried out ahead of storing the crawled web pages in to repositories. At first, the keywords are extracted from the crawled pages and the similarity score between two pages is calculated based on the extracted keywords. The documents having similarity scores greater than a threshold value are considered as near duplicates. The detection has resulted in reduced memory for repositories and improved search engine quality.

推荐答案

如果您正在等待许可,请随时上手.我不介意,即使我认为它有点沉闷,并且在现实世界中与巧克力防火剂一样有用.

如果您正在等待志愿者为您编写代码,那将是作弊.而且你不会那样做吗?
If you are waiting for permission, feel free to get started. I don''t mind, even though I think it is a bit dull, and about as useful in the real world as a chocolate fire-guard.

If you are waiting for volunteers to write the code for you, then that would be cheating. And you wouldn''t do that, would you?


这篇关于Web爬网中近乎重复的页面检测的一种新颖有效的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆