通过网络抓取避免IP阻止 [英] Avoid IP blocking with web scraping

查看:87
本文介绍了通过网络抓取避免IP阻止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,



我开发了一个网络刮刀(使用C#),每次都可以发出数千个请求。



问题是网站的服务器会在一些请求后阻止我的IP。



问题:



1-如何防止被阻止?

2-如何知道网站的服务器何时会阻止我的IP?我的意思是如何知道我的限制是一定数量的流量还是一定数量的请求。



谢谢。

解决方案

没有办法做到这一点。一种方法是将刮擦限制在非常慢的速度,这有点使刮擦的目的无效。



或者,将刮除扩展到多个域。例如选择100个域,从domain-1获取1个页面,然后从domain-2获取下一个页面,依此类推到domain-100,然后从domain-1获取第2页,然后从domain-2获取,依此类推。这里的诀窍是,这会人为地减慢你的刮削速度到原来的速度的1/100(从服务器的角度来看),但实际上你并没有因为从多个站点刮擦而失去了你的刮削速度。有道理吗?


1)这很容易。不要做任何可能导致被视为威胁的事情。用每次数千次请求轰炸一台服务器可能被认为是敌对的。



2)现在这是一个好主意。每个服务器都应发布有关其所有者可以容忍的滥用程度的信息。说真的,这非常像你在别人家里行为不端。当主人用领子抓住你并告诉你门时,你知道你可以走多远。


如果你能够用我的方式,就不可能阻挡你。



使用像Airties这样的ADSL调制解调器,让您的服务器使用该互联网连接并按计划发送重置命令。



这就像一个魅力。 :)

Hi all,

I developed a web scraper (using C#) that should be able to make thousands of requests each time.

The problem is that the website's server will block my IP after a number of requests.

Questions:

1- How to prevent being blocked?
2- How to know when will the website's server will block my IP? I mean how to know my limit whether being certain amount of traffic or certain number of requests.

Thanks.

解决方案

There's no way to do this. The one way would to be to limit the scraping to a very slow rate which kinda nullifies the very purpose of scraping.

Alternatively, spread the scraping out to multiple domains. For example pick a 100 domains, get 1 page from domain-1, then the next from domain-2, and so on till domain-100, then get the 2nd page from domain-1, then from domain-2, and so on. The trick here is that this artificially slows down your scraping to 1/100 of its former speed (from the server's perspective), but you don't actually lose out on your scraping speeds because you are scraping from multiple sites. Makes sense?


1) That's very easy. Don't do anything that might lead to being considered a threat. Bombarding a server with 'thousands of requests each time' could be considered to be hostile.

2) Now that's a good idea. Each server should post information on how much abuse its owner will tolerate. Seriously, it's very much like when you misbehave at somebody's house. When the owner grabs you by the collar and shows you the door, then you know how far you could go.


It's not possible to get block if you are able to use my way.

Use an ADSL modem like Airties, let your server use that internet connection and send reset command in a schedule.

That works like a charm. :)


这篇关于通过网络抓取避免IP阻止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆