如何为通过socksipy发出请求的scrapy编写一个DownloadHandler? [英] How to write a DownloadHandler for scrapy that makes requests through socksipy?
问题描述
我正在尝试在 Tor 上使用scrapy.我一直在想如何为使用socksipy 连接的scrapy 编写一个DownloadHandler.
I'm trying to use scrapy over Tor. I've been trying to get my head around how to write a DownloadHandler for scrapy that uses socksipy connections.
Scrapy 的 HTTP11DownloadHandler 在这里:https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py
Scrapy's HTTP11DownloadHandler is here: https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py
以下是创建自定义下载处理程序的示例:https://github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/dhandler.py
Here is an example for creating a custom download handler: https://github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/dhandler.py
以下是创建 SocksiPyConnection 类的代码:http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/
Here's the code for creating a SocksiPyConnection class: http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/
class SocksiPyConnection(httplib.HTTPConnection):
def __init__(self, proxytype, proxyaddr, proxyport = None, rdns = True, username = None, password = None, *args, **kwargs):
self.proxyargs = (proxytype, proxyaddr, proxyport, rdns, username, password)
httplib.HTTPConnection.__init__(self, *args, **kwargs)
def connect(self):
self.sock = socks.socksocket()
self.sock.setproxy(*self.proxyargs)
if isinstance(self.timeout, float):
self.sock.settimeout(self.timeout)
self.sock.connect((self.host, self.port))
由于scrapy 代码中扭曲反应器的复杂性,我无法弄清楚如何将socksipy 插入其中.有什么想法吗?
With the complexity of twisted reactors in the scrapy code, I can't figure out how plug socksipy into it. Any thoughts?
请不要用类似 privoxy 的替代方案回答或发布答案说scrapy 不适用于袜子代理" - 我知道,这就是为什么我正在尝试编写一个自定义下载器,使用socksipy 发出请求.
Please do not answer with privoxy-like alternatives or post answers saying "scrapy doesn't work with socks proxies" - I know that, which is why I'm trying to write a custom Downloader that makes requests using socksipy.
推荐答案
我能够使用 https://github.com/habnabit/txsocksx.
在做了一个pip install txsocksx
之后,我需要用txsocksx.http.SOCKS5Agent
替换scrapy的ScrapyAgent
.
After doing a pip install txsocksx
, I needed to replace scrapy's ScrapyAgent
with txsocksx.http.SOCKS5Agent
.
我只是从 scrapy/core/downloader/handlers/http.py
复制了 HTTP11DownloadHandler
和 ScrapyAgent
的代码,将它们子类化并编写这段代码:
I simply copied the code for HTTP11DownloadHandler
and ScrapyAgent
from scrapy/core/downloader/handlers/http.py
, subclassed them and wrote this code:
class TorProxyDownloadHandler(HTTP11DownloadHandler):
def download_request(self, request, spider):
"""Return a deferred for the HTTP download"""
agent = ScrapyTorAgent(contextFactory=self._contextFactory, pool=self._pool)
return agent.download_request(request)
class ScrapyTorAgent(ScrapyAgent):
def _get_agent(self, request, timeout):
bindaddress = request.meta.get('bindaddress') or self._bindAddress
proxy = request.meta.get('proxy')
if proxy:
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
scheme = _parse(request.url)[0]
omitConnectTunnel = proxyParams.find('noconnect') >= 0
if scheme == 'https' and not omitConnectTunnel:
proxyConf = (proxyHost, proxyPort,
request.headers.get('Proxy-Authorization', None))
return self._TunnelingAgent(reactor, proxyConf,
contextFactory=self._contextFactory, connectTimeout=timeout,
bindAddress=bindaddress, pool=self._pool)
else:
_, _, host, port, proxyParams = _parse(request.url)
proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
timeout=timeout, bindAddress=bindaddress)
agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)
return agent
return self._Agent(reactor, contextFactory=self._contextFactory,
connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
在settings.py中,需要这样的东西:
In settings.py, something like this is needed:
DOWNLOAD_HANDLERS = {
'http': 'crawler.http.TorProxyDownloadHandler'
}
现在使用 Scrapy 代理,通过像 Tor 这样的袜子代理工作.
Now proxying with Scrapy with work through a socks proxy like Tor.
这篇关于如何为通过socksipy发出请求的scrapy编写一个DownloadHandler?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!