用于网页抓取的旋转代理 [英] Rotating Proxies for web scraping

查看:68
本文介绍了用于网页抓取的旋转代理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 python 网络爬虫,我想在许多不同的代理服务器之间分发下载请求,可能运行的是鱿鱼(尽管我愿意接受替代方案).例如,它可以以循环方式工作,其中 request1 转到 proxy1,request2 转到 proxy2,并最终循环返回.知道如何设置吗?

I've got a python web crawler and I want to distribute the download requests among many different proxy servers, probably running squid (though I'm open to alternatives). For example, it could work in a round-robin fashion, where request1 goes to proxy1, request2 to proxy2, and eventually looping back around. Any idea how to set this up?

为了让它更难,我还希望能够动态更改可用代理列表,减少一些并添加其他.

To make it harder, I'd also like to be able to dynamically change the list of available proxies, bring some down, and add others.

如果重要,IP 地址是动态分配的.

If it matters, IP addresses are assigned dynamically.

谢谢:)

推荐答案

让你的爬虫有一个代理列表,并且每个 HTTP 请求让它以循环方式使用列表中的下一个代理.但是,这将阻止您使用 HTTP/1.1 持久连接.修改代理列表最终会导致使用新代理或不使用代理.

Make your crawler have a list of proxies and with each HTTP request let it use the next proxy from the list in a round robin fashion. However, this will prevent you from using HTTP/1.1 persistent connections. Modifying the proxy list will eventually result in using a new or not using a proxy.

或者并行打开多个连接,每个代理一个,并将您的抓取请求分发到每个打开的连接.动态可以通过让连接器向请求分派器注册自身来实现.

Or have several connections open in parallel, one to each proxy, and distribute your crawling requests to each of the open connections. Dynamics may be implemented by having the connetor registering itself with the request dispatcher.

这篇关于用于网页抓取的旋转代理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆