一个非常简单的C ++网络爬虫/蜘蛛? [英] A very simple C++ web crawler/spider?

查看:536
本文介绍了一个非常简单的C ++网络爬虫/蜘蛛?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在C ++中做一个非常简单的网络爬虫/蜘蛛应用程序。我已经搜索google了一个简单的一个来理解的概念。我发现这个:



http ://www.example-code.com/vcpp/spider.asp



但是,它有点复杂/很难为我消化。 p>

我想要做的只是,例如:



输入url:www.example.com我将使用bash-> wget,得到内容/源代码)



然后,将寻找,也许一个href链接,然后存储在一些数据



我刚刚开始学习C ++(1个月) )

解决方案

好吧,我会尽力指出正确的方向。从概念上来说,webcrawler非常简单。它围绕存储挂起的URL的FIFO队列数据结构。 C ++在标准库中有一个内置的队列结构, std :: queue ,您可以使用它将网址存储为字符串。



基本算法非常简单:



  1. 选择,并将其放在队列的顶部
    您的队列

  2. 弹出队列顶部的URL
    并下载

  3. 解析下载的HTML文件并解压缩所有链接

  4. 将每个提取的链接插入队列



  5. 现在,我说一个网络爬行器在概念上是简单的,但实现它不是那么简单。从上述算法中可以看出,您需要:允许您下载网址的HTTP网络库一个好的HTML解析器,可以让您提取链接。您提到您可以使用 wget 下载网页。这简化了一些事情,但你仍然需要实际解析下载的HTML文档。正确解析HTML是一项不平凡的任务。一个简单的字符串搜索< a href = 只会有时工作。然而,如果这只是一个玩具程序,你用来熟悉C ++,一个简单的字符串搜索可能就足够了你的目的。否则,您需要使用严重的HTML解析库。



    在编写网络抓取工具时,您还需要考虑其他注意事项,例如礼貌。人们会生气, IP,如果您尝试从同一主机下载太多页面,太快。因此,您可能需要实施某种策略,您的webcrawler在下载每个网站之前等待一段时间。您还需要一些机制避免再次下载相同的网址,请遵守漫游器排除协议,避免抓取陷阱等。 。所有这些细节加起来,使实际上实现一个健壮的webcrawler不是这么简单的事情。



    也就是说,我同意larsmans的意见。 webcrawler不是学习C ++的最好方法。此外,C ++并不是写入webcrawler的最好的语言。在编写一个像webcrawler这样的程序时,你在C ++中获得的原始性能和低级访问是无用的,它花费大部分时间等待URL解析,下载。在我看来,像Python这样的高级脚本语言更适合这个任务。


    I am trying to do a very simple web crawler/spider app in C++. I have been searched google for a simple one to understand the concept. And I found this:

    http://www.example-code.com/vcpp/spider.asp

    But, its kinda bit complicated/hard to digest for me.

    What I am trying to do is just, for example:

    enter the url: www.example.com (i will use bash->wget, to get the contents/source code)

    then, will look for, maybe "a href" link, and then store in some data file.

    Any simple tutorial, or guidelines for me?

    I am just starting learning C++ (1 month)

    解决方案

    All right, I'll try to point you in the right direction. Conceptually, a webcrawler is pretty simple. It revolves around a FIFO queue data structure which stores pending URLs. C++ has a built-in queue structure in the standard libary, std::queue, which you can use to store URLs as strings.

    The basic algorithm is pretty straightforward:

    1. Begin with a base URL that you select, and place it on the top of your queue
    2. Pop the URL at the top of the queue and download it
    3. Parse the downloaded HTML file and extract all links
    4. Insert each extracted link into the queue
    5. Goto step 2, or stop once you reach some specified limit

    Now, I said that a webcrawler is conceptually simple, but implementing it is not so simple. As you can see from the above algorithm, you'll need: an HTTP networking library to allow you to download URLs, and a good HTML parser that will let you extract links. You mentioned you could use wget to download pages. That simplifies things somewhat, but you still need to actually parse the downloaded HTML docs. Parsing HTML correctly is a non-trivial task. A simple string search for <a href= will only work sometimes. However, if this is just a toy program that you're using to familiarize yourself with C++, a simple string search may suffice for your purposes. Otherwise, you need to use a serious HTML parsing library.

    There are also other considerations you need to take into account when writing a webcrawler, such as politeness. People will be pissed and possibly ban your IP if you attempt to download too many pages, too quickly, from the same host. So you may need to implement some sort of policy where your webcrawler waits for a short period before downloading each site. You also need some mechanism to avoid downloading the same URL again, obey the robots exclusion protocol, avoid crawler traps, etc... All these details add up to make actually implementing a robust webcrawler not such a simple thing.

    That said, I agree with larsmans in the comments. A webcrawler isn't the greatest way to learn C++. Also, C++ isn't the greatest language to write a webcrawler in. The raw-performance and low-level access you get in C++ is useless when writing a program like a webcrawler, which spends most of its time waiting for URLs to resolve and download. A higher-level scripting language like Python or something is better suited for this task, in my opinion.

    这篇关于一个非常简单的C ++网络爬虫/蜘蛛?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆