如何为通过socksipy发出请求的scrapy编写一个DownloadHandler? [英] How to write a DownloadHandler for scrapy that makes requests through socksipy?

查看:65
本文介绍了如何为通过socksipy发出请求的scrapy编写一个DownloadHandler?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Tor 上使用scrapy.我一直在想如何为使用socksipy 连接的scrapy 编写一个DownloadHandler.

I'm trying to use scrapy over Tor. I've been trying to get my head around how to write a DownloadHandler for scrapy that uses socksipy connections.

Scrapy 的 HTTP11DownloadHandler 在这里:https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py

Scrapy's HTTP11DownloadHandler is here: https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py

以下是创建自定义下载处理程序的示例:https://github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/dhandler.py

Here is an example for creating a custom download handler: https://github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/dhandler.py

以下是创建 SocksiPyConnection 类的代码:http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/

Here's the code for creating a SocksiPyConnection class: http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/

class SocksiPyConnection(httplib.HTTPConnection):
    def __init__(self, proxytype, proxyaddr, proxyport = None, rdns = True, username = None, password = None, *args, **kwargs):
        self.proxyargs = (proxytype, proxyaddr, proxyport, rdns, username, password)
        httplib.HTTPConnection.__init__(self, *args, **kwargs)

    def connect(self):
        self.sock = socks.socksocket()
        self.sock.setproxy(*self.proxyargs)
        if isinstance(self.timeout, float):
            self.sock.settimeout(self.timeout)
        self.sock.connect((self.host, self.port))

由于scrapy 代码中扭曲反应器的复杂性,我无法弄清楚如何将socksipy 插入其中.有什么想法吗?

With the complexity of twisted reactors in the scrapy code, I can't figure out how plug socksipy into it. Any thoughts?

请不要用类似 privoxy 的替代方案回答或发布答案说scrapy 不适用于袜子代理" - 我知道,这就是为什么我正在尝试编写一个自定义下载器,使用socksipy 发出请求.

Please do not answer with privoxy-like alternatives or post answers saying "scrapy doesn't work with socks proxies" - I know that, which is why I'm trying to write a custom Downloader that makes requests using socksipy.

推荐答案

我能够使用 https://github.com/habnabit/txsocksx.

在做了一个pip install txsocksx之后,我需要用txsocksx.http.SOCKS5Agent替换scrapy的ScrapyAgent.

After doing a pip install txsocksx, I needed to replace scrapy's ScrapyAgent with txsocksx.http.SOCKS5Agent.

我只是从 scrapy/core/downloader/handlers/http.py 复制了 HTTP11DownloadHandlerScrapyAgent 的代码,将它们子类化并编写这段代码:

I simply copied the code for HTTP11DownloadHandler and ScrapyAgent from scrapy/core/downloader/handlers/http.py, subclassed them and wrote this code:

class TorProxyDownloadHandler(HTTP11DownloadHandler):

    def download_request(self, request, spider):
        """Return a deferred for the HTTP download"""
        agent = ScrapyTorAgent(contextFactory=self._contextFactory, pool=self._pool)
        return agent.download_request(request)


class ScrapyTorAgent(ScrapyAgent):
    def _get_agent(self, request, timeout):
        bindaddress = request.meta.get('bindaddress') or self._bindAddress
        proxy = request.meta.get('proxy')
        if proxy:
            _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
            scheme = _parse(request.url)[0]
            omitConnectTunnel = proxyParams.find('noconnect') >= 0
            if  scheme == 'https' and not omitConnectTunnel:
                proxyConf = (proxyHost, proxyPort,
                             request.headers.get('Proxy-Authorization', None))
                return self._TunnelingAgent(reactor, proxyConf,
                    contextFactory=self._contextFactory, connectTimeout=timeout,
                    bindAddress=bindaddress, pool=self._pool)
            else:
                _, _, host, port, proxyParams = _parse(request.url)
                proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                    timeout=timeout, bindAddress=bindaddress)
                agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)
                return agent

        return self._Agent(reactor, contextFactory=self._contextFactory,
            connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

在settings.py中,需要这样的东西:

In settings.py, something like this is needed:

DOWNLOAD_HANDLERS = {
    'http': 'crawler.http.TorProxyDownloadHandler'
}

现在使用 Scrapy 代理,通过像 Tor 这样的袜子代理工作.

Now proxying with Scrapy with work through a socks proxy like Tor.

这篇关于如何为通过socksipy发出请求的scrapy编写一个DownloadHandler?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆