在爬网中获得超过请求的限制 [英] Get past request limit in crawling a web site

查看:81
本文介绍了在爬网中获得超过请求的限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用一个网络爬虫,该爬虫为不想被索引的网站编制了索引。

I'm working on a web crawler that indexes sites that don't want to be indexed.

我的第一次尝试:
我写了一个c#搜寻器,它浏览每个页面并下载它们。
这导致我的IP在10分钟内被服务器阻塞。

My first attempt: I wrote a c# crawler that goes through each and every page and downloads them. This resulted in my IP being blocked by their servers within 10 minutes.

我将其移至Amazon EC2并编写了可运行约50个实例的分布式python脚本。这仍高于他们引导我的门槛。这每月也要花费1900美元左右。

I moved it to amazon EC2 and wrote a distributed python script that runs about 50 instances. This stays just above their threshold of booting me. This also costs about $1900 a month...

我回到了最初的想法,将它放在简化版TOR网络的后面。这行得通,但是非常慢。

I moved back to my initial idea and put it behind a shortened version of the TOR network. This worked, but was very slow.

我没主意了。我该如何克服它们阻止了我的重复请求。

I'm out of ideas. How can I get past them blocking me for repeated requests.

我说阻止实际上是给我一个随机404错误,但该错误确实存在于页面上。这是随机的,只有在我在一小时内传递约300个请求后才开始发生。

The I say "block" they are actually giving me a random 404 not found error on pages that definitely exist. It's random and only starts happening after I pass about 300 requests in an hour.

推荐答案

首先,最重要的是:如果网站不希望您过于频繁地对其进行爬网,那么您不应该!这是基本的礼貌,您应该始终坚持下去。

但是,我确实了解到有些网站(例如Google)可以赚钱通过整天爬网您的网站,当您尝试爬网Google时,它们就会阻止您。

However, I do understand that there are some websites, like Google, who make their money by crawling your website all day long and when you try to crawl Google, then they block you.

无论如何,获得一堆EC2计算机的替代方法是获得代理服务器。代理服务器要比EC2便宜得多,大小写要点:$ b​​ $ b http://5socks.net/en_proxy_socks_tarifs.htm

In any case, the alternative to getting a bunch of EC2 machines is to get proxy servers. Proxy servers are MUCH cheaper than EC2, case and point: http://5socks.net/en_proxy_socks_tarifs.htm

当然,代理服务器的速度不及EC2(在带宽方面),但是您应该能够在获得与您的吞吐量相似或更高的吞吐量时取得平衡50个EC2实例的价格大大低于您现在所支付的价格。这涉及到您搜索负担得起的代理,并找到可以提供相似结果的代理。这里要注意的一点是,就像您一样,可能会有其他人使用代理服务来爬网您正在爬网的网站,并且他们对爬网的方式可能不那么聪明,因此整个代理服务可能会由于被阻止而被封锁

Of course, proxy servers are not as fast as EC2 (bandwidth wise), but you should be able to strike a balance where you're getting similar or higher throughput than your 50 EC2 instances for substantially less than what you're paying now. This involves you searching for affordable proxies and finding ones that will give you similar results. A thing to note here is that just like you, there may be other people using the proxy service to crawl the website you're crawling and they may not be as smart about how they crawl it, so the whole proxy service can get blocked due to the activity of some other client of the proxy service (I've personally seen it).

这有点疯狂,我还没有做完这件事的数学运算,但是您可以自己启动代理服务并将代理服务出售给其他人。无论如何,您都无法使用所有EC2机器的带宽,因此,降低成本的最佳方法是执行Amazon的工作:分租硬件。

This is a little crazy and I haven't done the math behind this, but you could start a proxy service yourself and sell proxy services to others. You can't use all of your EC2 machine's bandwidth anyway, so the best way for you to cut cost is to do what Amazon does: sub-lease the hardware.

这篇关于在爬网中获得超过请求的限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆