C#,数据结构算法 [英] C#,Data Structure algorithm

查看:112
本文介绍了C#,数据结构算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好!我要编写网站搜寻器,它从根地址开始,然后搜寻每个找到的链接(仅内部链接).所以我遇到了这个问题:搜寻器必须从root开始,然后它应该解析网页(root page),然后获取所有链接.获取链接时,它不应两次抓取同一页面.伙计们有没有好的数据结构,或者我需要使用SQL或其他索引数据结构吗? .使用URI作为密钥,您将永远无法重复它们.

有问题的想法.首先,相同的URI可以以某种不同的形式表示.例如,"http://www.codeproject.com","http://www.codeproject.com"和"http://www.CodeProject.com".您将需要创建then的统一表示并将所有URI减少到其中.这不是那么简单.取下外壳.在基于Windows的HTTP服务器上,域名后面的URI部分的大小写不区分大小写,因此不同的大小写表示相同的URI.在类似Unix的系统上,它们区分大小写.如果将URI的这一部分缩小为相同大小写,您将获得一个不起作用的链接.

最后,相同的页面可以具有不同的URI.我认为您无法执行任何操作,因为这与网站内部情况有关.您的搜寻器只会将它们视为不同的页面.

—SA


Hello Guys! I am going to write website crawler, which starts from root address then crawls every found link (only internal links). So I face this problem: Crawler must start from root, then it should parse web page (root page) and then get all links. While getting links, it should not crawl the same page twice. Guys is there any good data structure or do I need to use SQL or someother indexing data structures?

解决方案

The simplest indexing structure you can use is dictionary, System.Collections.Generic.Dictionary. Using URI as a key, you will be able to never repeat them.

There are problems thought. First, the same URI can be expressed in somewhat different forms. For example, "http://www.codeproject.com", "http://www.codeproject.com" and "http://www.CodeProject.com". You will need to create some unified representation of then and reduce all URIs to it. This is not so simple. Take the casing. On windows-based HTTP servers, the casing of the part of URIs after domain name is case-insensitive, so different case mean the same URI. On Unix-like systems, they are case sensitive. If you reduce this part of URI to the same case, you will get a link which does not work.

And, finally, the same pages can have different URIs. I don''t think you can do anything with that, because this is related to the site internals. You crawler will simply consider them as different pages.

—SA


这篇关于C#,数据结构算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆