网络爬虫性能 [英] web crawler performance

查看:74
本文介绍了网络爬虫性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很想知道在非常普遍的情况下(自制的业余网络爬虫),这样的性能会是什么.更具体地说,一个抓取工具可以处理多少个页面.

I am interested to know in a very general situation (a home-brew amateur web crawler) what will be the performance of such. More specifically how many pages can a crawler process.

当我说自制软件时,从各种意义上来说,2.4Ghz 核心 2 处理器、Java 编写、50 兆位互联网速度等.

When i say home-brew take that in all senses, a 2.4Ghz core 2 processor, java written, 50mbit internet speed, etc, etc.

非常感谢您在这方面分享的任何资源

Any resources you may share in this regard will be greatly appreciated

非常感谢,

卡洛斯

推荐答案

首先,计算机的速度不会成为限制因素;至于连接,你应该人为地限制你的爬虫的速度——如果你开始攻击它们,大多数网站都会禁止你的 IP 地址.换句话说,不要太快地抓取网站(每个请求 10 秒以上应该可以处理 99.99% 的网站,但低于该值则后果自负).

First of all, the speed of your computer won't be the limiting factor; as for the connection, you should artificially limit the speed of your crawler - most sites will ban your IP address if you start hammering them. In other words, don't crawl a site too quickly (10+ seconds per request should be OK with 99.99% of the sites, but go below that at your own peril).

因此,虽然您可以在多个线程中抓取单个站点,但我建议每个线程抓取不同的站点(检查它是否也不是共享 IP 地址);这样,您就可以降低被蜘蛛网站禁止访问的可能性,从而使您的连接饱和.

So, while you could crawl a single site in multiple threads, I'd suggest that each thread crawls a different site (check if it's also not a shared IP address); that way, you could saturate your connection with a lower chance of getting banned from the spidered site.

有些网站不希望您抓取网站的某些部分,您应该遵循一种常用的机制:robots.txt 文件.阅读链接的网站并执行此操作.

Some sites don't want you to crawl parts of the site, and there's a commonly used mechanism that you should follow: the robots.txt file. Read the linked site and implement this.

另请注意,有些网站根本禁止任何自动抓取;根据网站的管辖范围(您的管辖范围也可能适用),违反此规则可能是非法的(您应对自己的脚本所做的事情负责,机器人做到了"甚至不是借口,更不是辩护).

Note also, that some sites prohibit any automated crawling at all; depending on the site's jurisdiction (yours may also apply), breaking this may be illegal (you are responsible for what your script does, "the robot did it" is not even an excuse, much less a defense).

这篇关于网络爬虫性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆