网络蜘蛛与Wget蜘蛛有何不同? [英] How do web spiders differ from Wget's spider?

查看:96
本文介绍了网络蜘蛛与Wget蜘蛛有何不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Wget手册中的下一句话引起了我的注意

The next sentence caught my eye in Wget's manual

wget --spider --force-html -i bookmarks.html

This feature needs much more work for Wget to get close to the functionality of real web spiders.

我发现以下与wget中的spider选项相关的代码行.

I find the following lines of code relevant for the spider option in wget.

src/ftp.c
780:      /* If we're in spider mode, don't really retrieve anything.  The
784:      if (opt.spider)
889:  if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST)))
1227:      if (!opt.spider)
1239:      if (!opt.spider)
1268:      else if (!opt.spider)
1827:          if (opt.htmlify && !opt.spider)

src/http.c
64:#include "spider.h"
2405:  /* Skip preliminary HEAD request if we're not in spider mode AND
2407:  if (!opt.spider
2428:      if (opt.spider && !got_head)
2456:      /* Default document type is empty.  However, if spider mode is
2570:           * spider mode.  */
2571:          else if (opt.spider)
2661:              if (opt.spider)

src/res.c
543:  int saved_sp_val = opt.spider;
548:  opt.spider       = false;
551:  opt.spider       = saved_sp_val;  

src/spider.c
1:/* Keep track of visited URLs in spider mode.
37:#include "spider.h"
49:spider_cleanup (void)

src/spider.h
1:/* Declarations for spider.c

src/recur.c
52:#include "spider.h"
279:      if (opt.spider)
366:              || opt.spider /* opt.recursive is implicitely true */
370:             (otherwise unneeded because of --spider or rejected by -R) 
375:                   (opt.spider ? "--spider" : 
378:                     (opt.delete_after || opt.spider
440:      if (opt.spider) 

src/options.h
62:  bool spider;           /* Is Wget in spider mode? */

src/init.c
238:  { "spider",           &opt.spider,            cmd_boolean },

src/main.c
56:#include "spider.h"
238:    { "spider", 0, OPT_BOOLEAN, "spider", -1 },
435:       --spider                  don't download anything.\n"),
1045:  if (opt.recursive && opt.spider)

我想看看代码上的差异,而不是抽象的.我喜欢代码示例.

I would like to see the differences in code, not abstractly. I love code examples.

网络蜘蛛与代码中的Wget蜘蛛有何不同?

推荐答案

真正的蜘蛛需要大量工作

为整个WWW编写蜘蛛程序是一项艰巨的任务---您必须注意许多小细节",例如:

Writing a spider for the whole WWW is quite a task --- you have to take care about many "little details" such as:

  • 每台蜘蛛计算机应并行接收来自数千台服务器的数据,以便有效利用连接带宽. (异步套接字I/O).
  • 您需要几台并行运行的计算机,以覆盖WWW上的大量信息(群集;对工作进行分区)
  • 您需要对蜘蛛侠的网站彬彬有礼:
    • 检查robots.txt文件.
    • 不要太快地获取很多信息:这会使服务器超载.
    • 不要获取您真正不需要的文件(例如iso磁盘映像;用于软件下载的tgz软件包...).
    • Each spider computer should receive data from a few thousand servers in parallel in order to make efficient use of the connection bandwidth. (asynchronous socket i/o).
    • You need several computers that spider in parallel in order to cover the vast amount of information on the WWW (clustering; partitioning the work)
    • You need to be polite to the spidered web sites:
      • Respect the robots.txt files.
      • Don't fetch a lot of information too quickly: this overloads the servers.
      • Don't fetch files that you really don't need (e.g. iso disk images; tgz packages for software download...).

      这是很多工作.但是,如果您的目标比阅读整个WWW更为谦虚,则可以跳过其中的某些部分.如果您只想下载Wiki等的副本,那么请遵循wget的规范.

      This is a lot of work. But if your target is more modest than reading the whole WWW, you may skip some of the parts. If you just want to download a copy of a wiki etc. you get down to the specs of wget.

      注意:如果您不相信它有那么多工作,您可能想读一读Google如何重新发明大多数计算轮(在基本Linux内核之上)以构建良好的蜘蛛.即使您偷工减料,也需要做很多工作.

      Note: If you don't believe that it's so much work, you may want to read up on how Google re-invented most of the computing wheels (on top of the basic Linux kernel) to build good spiders. Even if you cut a lot of corners, it's a lot of work.

      让我在三点上再加上一些技术性的评论

      Let me add a few more technical remarks on three points

      并行连接/异步套接字通信

      您可以在并行进程或线程中运行多个蜘蛛程序.但是您需要大约5000-10000个并行连接才能充分利用网络连接.而且,如此数量的并行进程/线程会产生过多的开销.

      You could run several spider programs in parallel processes or threads. But you need about 5000-10000 parallel connections in order to make good use of your network connection. And this amount of parallel processes/threads produces too much overhead.

      更好的解决方案是异步输入/输出:通过在非阻塞模式下打开套接字并使用epoll或选择仅处理已接收数据的那些连接,在一个线程中处理大约1000个并行连接.从Linux内核2.4开始,Linux对可伸缩性提供了出色的支持(我还建议您研究内存映射文件)在以后的版本中不断得到改进.

      A better solution is asynchronous input/output: process about 1000 parallel connections in one single thread by opening the sockets in non-blocking mode and use epoll or select to process just those connections that have received data. Since Linux kernel 2.4, Linux has excellent support for scalability (I also recommend that you study memory-mapped files) continuously improved in later versions.

      注意:与使用快速语言"相比,使用异步I/O的作用要大得多:为Perl编写的1000个连接编写一个epoll驱动的进程比运行以C语言编写的1000个进程更好. ,您可以使用perl编写的进程使100Mb连接饱和.

      Note: Using asynchronous i/o helps much more than using a "fast language": It's better to write an epoll-driven process for 1000 connections written in Perl than to run 1000 processes written in C. If you do it right, you can saturate a 100Mb connection with processes written in perl.

      来自原始答案: 这种方法的缺点是您必须自己以异步形式实现HTTP规范(我不知道执行此操作的可重用库).与现代的HTTP/1.1协议相比,使用更简单的HTTP/1.0协议执行此操作要容易得多.无论如何,您可能都不会从HTTP/1.1的优势中受益,因此,这可能是节省一些开发成本的好地方.

      From the original answer: The down side of this approach is that you will have to implement the HTTP specification yourself in an asynchronous form (I am not aware of a re-usable library that does this). It's much easier to do this with the simpler HTTP/1.0 protocol than the modern HTTP/1.1 protocol. You probably would not benefit from the advantages of HTTP/1.1 for normal browsers anyhow, so this may be a good place to save some development costs.

      五年后 今天,有很多免费/开源技术可以帮助您完成这项工作.我个人喜欢 http实现 /questions/tagged/node.js">node.js ---它可以节省您上面原始段落中提到的所有工作.当然,今天,您的蜘蛛网中也有许多模块可用于您需要的其他组件.但是请注意,第三方模块的质量可能会有很大差异.您必须检查使用的内容. [老化信息:] 最近,我使用node.js编写了一个Spider,发现用于链接和数据提取的HTML处理的npm模块的可靠性不足.对于这项工作,我将此处理外包"到了用另一种编程语言编写的处理中.但是事情正在迅速变化,当您阅读此评论时,这个问题可能已经成为过去...

      Edit five years later: Today, there is a lot of free/open source technology available to help you with this work. I personally like the asynchronous http implementation of node.js --- it saves you all the work mentioned in the above original paragraph. Of course, today there are also a lot of modules readily available for the other components that you need in your spider. Note, however, that the quality of third-party modules may vary considerably. You have to check out whatever you use. [Aging info:] Recently, I wrote a spider using node.js and I found the reliability of npm modules for HTML processing for link and data extraction insufficient. For this job, I "outsourced" this processing to a process written in another programming language. But things are changing quickly and by the time you read this comment, this problem may already a thing of the past...

      将工作分区到多个服务器上

      一台计算机跟不上整个WWW的速度.您需要将您的工作分配到多个服务器上,并在它们之间交换信息.我建议为每台服务器分配某些域名范围":保留域名中心数据库,并引用蜘蛛计算机.

      One computer can't keep up with spidering the whole WWW. You need to distribute your work over several servers and exchange information between them. I suggest to assign certain "ranges of domain names" to each server: keep a central data base of domain names with a reference to a spider computer.

      从接收的网页中批量提取URL:根据域名进行排序;删除重复项并将其发送到负责的蜘蛛计算机.在该计算机上,保留已获取的URL的索引,并获取其余的URL.

      Extract URLs from received web pages in batches: sort them according to their domain names; remove duplicates and send them to the responsible spider computer. On that computer, keep an index of URLs that already are fetched and fetch the remaining URLs.

      如果在每台蜘蛛计算机上保留要等待提取的URL队列,则不会出现性能瓶颈.但是要实现这一点,要进行大量的编程.

      If you keep a queue of URLs waiting to be fetched on each spider computer, you will have no performance bottlenecks. But it's quite a lot of programming to implement this.

      阅读标准

      我提到了几个标准(HTTP/1.x,Robots.txt,Cookie).花点时间阅读并实现它们.如果仅遵循您所知道的网站的示例,就会出错(忘记与样本无关的标准部分),并给使用这些附加功能的网站带来麻烦.

      I mentioned several standards (HTTP/1.x, Robots.txt, Cookies). Take your time to read them and implement them. If you just follow examples of sites that you know, you will make mistakes (forget parts of the standard that are not relevant to your samples) and cause trouble for those sites that use these additional features.

      阅读HTTP/1.1标准文档很痛苦.但是所有的小细节都被添加到了其中,因为有人真的需要那个小细节,现在就使用它.

      It's a pain to read the HTTP/1.1 standard document. But all the little details got added to it because somebody really needs that little detail and now uses it.

      这篇关于网络蜘蛛与Wget蜘蛛有何不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆