从被封锁防止自定义Web爬虫 [英] Prevent Custom Web Crawler from being blocked

查看:190
本文介绍了从被封锁防止自定义Web爬虫的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建一个新的网络爬虫 C#抓取一些特定的网站。每一件事情去罚款。但问题是,一些网站被挡住了我的履带式IP地址某些请求后。我试图用我的检索请求之间的时间戳。但没有奏效。

I am creating a new web crawler using C# to crawl some specific websites. every thing goes fine. but the problem is that some websites are blocking my crawler IP address after some requests. I tried using timestamps between my crawl requests. but did not worked.

有没有什么办法,以防止网站挡住了我的履带式?
这样一些解决方案,将有助于(但我需要知道如何应用它们):

is there any way to prevent websites from blocking my crawler ? some solutions like this would help (but I need to know how to apply them):


  • 模拟谷歌机器人或雅虎啜食

  • 使用多个IP地址(事件伪造IP地址)作为履带式客户端IP

  • simulating Google bot or yahoo slurp
  • using multiple IP addresses (event fake IP addresses) as crawler client IP

任何解决方案会有所帮助。

any solution would help.

推荐答案

如果速度/吞吐量是不是一个巨大的关注,那么可能是最好的解决办法是通过安装 Tor和Privoxy 和路由你的爬行。那么你的履带式将有一个随机变化的IP地址。

If speed/throughput is not a huge concern, then probably the best solution is to install Tor and Privoxy and route your crawler through that. Then your crawler will have a randomly changing IP address.

这是一个非常有效的技术,如果你需要抓取不想让你抓取他们的网站。它还通过抓取工具很难追溯你的活动提供保护/不愿透露姓名的层。

This is a very effective technique if you need to crawl sites that do not want you crawling them. It also provides a layer of protection/anonymity by making the activities of your crawler very difficult to trace back to you.

当然,如果网站被阻止抓取工具,因为它只是走得太快,那么也许你应该只速率限制它一下。

Of course, if sites are blocking your crawler because it is just going too fast, then perhaps you should just rate-limit it a bit.

这篇关于从被封锁防止自定义Web爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆