如何在 Docker Compose 中通过 Splash 和 Tor over Privoxy 使用 Scrapy [英] How to use Scrapy with both Splash and Tor over Privoxy in Docker Compose

查看:36
本文介绍了如何在 Docker Compose 中通过 Splash 和 Tor over Privoxy 使用 Scrapy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行带有两个扩展"的 Scrapy 蜘蛛:

  1. Splash 用于渲染 JavaScript,
  2. Tor-Privoxy 以提供匿名性.

例如,我在 https://github.com/scrapy-plugins/scrapy-splash/tree/master/example.这是我的目录结构:

<预><代码>.├── docker-compose.yml└── 例子├── Dockerfile├── scrapy.cfg└── 速写├── __init__.py├── settings.py└── 蜘蛛├── __init__.py└── 引号.py

其中 example 目录是从 scrapy-splash 存储库克隆的.我添加了以下 docker-compose.yml 文件:

版本:'3'服务:刮刀:构建:./示例环境:- http_proxy=http://tor-privoxy:8118链接:- Tor-privoxy- 飞溅Tor-privoxy:图片:rdsubhas/tor-privoxy-alpine溅:图片:scrapinghub/飞溅

settings.py 文件中我更改了 SPLASH_URL 的位置:

# SPLASH_URL = 'http://127.0.0.1:8050/'SPLASH_URL = 'http://splash:8050'

因为 Splash 不是在本地主机上运行,​​而是在名为 splash 的单独链接容器中运行.scraperDockerfile

FROM python:alpine运行 apk --update 添加 libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl bash运行 pip install scrapy scrapy-splash复制 ./刮刀WORKDIR/刮刀CMD ["scrapy", "crawl", "quotes"]

问题是,当我使用 docker-compose builddocker-compose up 运行它时,我得到以下日志:

启动示例compose_tor-privoxy_1启动 examplecompose_splash_1重新创建 examplecompose_scraper_1附加到 examplecompose_splash_1、examplecompose_tor-privoxy_1、examplecompose_scraper_1飞溅_1 |2017-07-11 16:10:13+0000 [-] 日志打开.飞溅_1 |2017-07-11 16:10:13.794595 [-] 启动版本:3.0tor-privoxy_1 |2017-07-11 16:10:13.568 7f08e999eee8 信息:Privoxy 版本 3.0.23tor-privoxy_1 |2017-07-11 16:10:13.568 7f08e999eee8 信息:程序名称:privoxytor-privoxy_1 |Jul 11​​ 16:10:13.578 [notice] Tor v0.2.6.10 (git-58c51dc6087b0936) 在 Linux 上运行,Libevent 2.0.22-stable、OpenSSL 1.0.2d 和 Zlib 1.2.8.tor-privoxy_1 |Jul 11​​ 16:10:13.578 [notice] 如果你用错了 Tor 帮不了你!在 https://www.torproject.org/download/download#warning 了解如何确保安全飞溅_1 |2017-07-11 16:10:13.795925 [-] Qt 5.9.1、PyQt 5.9、WebKit 602.1、sip 4.19.3、Twisted 16.1.1、Lua 5.2飞溅_1 |2017-07-11 16:10:13.796204 [-] Python 3.5.2(默认,2016 年 11 月 17 日,17:05:23)[GCC 5.4.0 20160609]tor-privoxy_1 |Jul 11​​ 16:10:13.578 [notice] 配置文件/etc/tor/torrc"不存在,使用合理的默认值.tor-privoxy_1 |Jul 11​​ 16:10:13.581 [notice] 在 127.0.0.1:9050 打开 Socks 监听器飞溅_1 |2017-07-11 16:10:13.796541 [-] 打开文件限制:1048576tor-privoxy_1 |Jul 11​​ 16:10:13.000 [notice] 解析 GEOIP IPv4 文件/usr/share/tor/geoip.飞溅_1 |2017-07-11 16:10:13.796706 [-] 无法打开文件限制tor-privoxy_1 |Jul 11​​ 16:10:13.000 [notice] 解析 GEOIP IPv6 文件/usr/share/tor/geoip6.飞溅_1 |2017-07-11 16:10:13.903844 [-] Xvfb 启动:['Xvfb'、':1896918638'、'-screen'、'0'、'1024x768x24'、'-nolisten']、'tcp飞溅_1 |QStandardPaths:XDG_RUNTIME_DIR 未设置,默认为/tmp/runtime-root"tor-privoxy_1 |Jul 11​​ 16:10:13.000 [warn] 您正在以 root 身份运行 Tor.你不需要,你可能也不应该这样做.飞溅_1 |2017-07-11 16:10:13.984515 [-] 代理配置文件支持已启用,代理配置文件路径:/etc/splash/proxy-profilestor-privoxy_1 |7 月 11 日 16:10:13.000 [通知] 自举 0%:开始飞溅_1 |2017-07-11 16:10:14.041562 [-] 冗长 = 1飞溅_1 |2017-07-11 16:10:14.041732 [-] 插槽 = 50tor-privoxy_1 |Jul 11​​ 16:10:13.000 [notice] 自举 5%:连接到目录服务器飞溅_1 |2017-07-11 16:10:14.041806 [-]argument_cache_max_entries=500tor-privoxy_1 |Jul 11​​ 16:10:13.000 [notice] 自举 80%:连接到 Tor 网络飞溅_1 |2017-07-11 16:10:14.043083 [-] Web UI:启用,Lua:启用(沙盒:启用)飞溅_1 |2017-07-11 16:10:14.044088 [-] 网站从 8050 开始飞溅_1 |2017-07-11 16:10:14.044240 [-] 在 0x7f73a4e4b3c8 处启动工厂 <twisted.web.server.Site 对象>tor-privoxy_1 |Jul 11​​ 16:10:14.000 [notice] Bootstrapped 85%:完成第一跳握手刮刀_1 |2017-07-11 16:10:15 [scrapy.utils.log] 信息:Scrapy 1.4.0 开始(机器人:scrashtest)刮刀_1 |2017-07-11 16:10:15 [scrapy.utils.log] 信息:覆盖的设置:{'BOT_NAME':'scrashtest','DUPEFILTER_CLASS':'scrapy_splash.SplashAwareDupeFilter','HTTPCACHE_STORcrapySTORAGESTORAGEAspwareCache':, 'NEWSPIDER_MODULE': 'scrashtest.spiders', 'SPIDER_MODULES': ['scrashtest.spiders']}刮刀_1 |2017-07-11 16:10:15 [scrapy.middleware] 信息:启用扩展:刮刀_1 |['scrapy.extensions.corestats.CoreStats',刮刀_1 |'scrapy.extensions.telnet.TelnetConsole',刮刀_1 |'scrapy.extensions.memusage.MemoryUsage',刮刀_1 |'scrapy.extensions.logstats.LogStats']刮刀_1 |2017-07-11 16:10:15 [scrapy.middleware] 信息:启用下载器中间件:刮刀_1 |['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',刮刀_1 |'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',刮刀_1 |'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',刮刀_1 |'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',刮刀_1 |'scrapy.downloadermiddlewares.retry.RetryMiddleware',刮刀_1 |'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',刮刀_1 |'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',刮刀_1 |'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',刮刀_1 |'scrapy_splash.SplashCookiesMiddleware',刮刀_1 |'scrapy_splash.SplashMiddleware',刮刀_1 |'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',刮刀_1 |'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',刮刀_1 |'scrapy.downloadermiddlewares.stats.DownloaderStats']刮刀_1 |2017-07-11 16:10:15 [scrapy.middleware] 信息:启用蜘蛛中间件:刮刀_1 |['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',刮刀_1 |'scrapy_splash.SplashDeduplicateArgsMiddleware',刮刀_1 |'scrapy.spideriddlewares.offsite.OffsiteMiddleware',刮刀_1 |'scrapy.spidermiddlewares.referer.RefererMiddleware',刮刀_1 |'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',刮刀_1 |'scrapy.spidermiddlewares.depth.DepthMiddleware']刮刀_1 |2017-07-11 16:10:15 [scrapy.middleware] 信息:启用项目管道:刮刀_1 |[]刮刀_1 |2017-07-11 16:10:15 [scrapy.core.engine] 信息:Spider 打开刮刀_1 |2017-07-11 16:10:15 [scrapy.extensions.logstats] 信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)刮刀_1 |2017-07-11 16:10:15 [scrapy.extensions.telnet] 调试:Telnet 控制台监听 127.0.0.1:6023tor-privoxy_1 |Jul 11​​ 16:10:16.000 [notice] 自举 90%:建立 Tor 电路tor-privoxy_1 |Jul 11​​ 16:10:17.000 [notice] Tor 已成功开通电路.看起来客户端功能正在运行.tor-privoxy_1 |7 月 11 日 16:10:17.000 [通知] 自举 100%:完成tor-privoxy_1 |16年7月11日:10:17.000 [警告]接收的HTTP状态代码404从服务器 '216.218.222.10:443'( 未找到")撷取 /符/键/FP/585769C78764D58426B8B52B6651A5A71137189A + 80550987E1D626E3EBA5E5E75A458DE0626D088C".刮刀_1 |2017-07-11 16:10:29 [scrapy.core.engine] 调试:爬网 (200) <GET http://quotes.toscrape.com/>(参考:无)刮刀_1 |2017-07-11 16:10:29 [scrapy.spidermiddlewares.offsite] 调试:过滤到www.goodreads.com"的异地请求:<GET https://www.goodreads.com/quotes>刮刀_1 |2017-07-11 16:10:29 [scrapy.spidermiddlewares.offsite] 调试:过滤到scrapinghub.com"的异地请求:<GET https://scrapinghub.com>tor-privoxy_1 |Jul 11​​ 16:10:44.000 [notice] 尝试在 3 个不同的地方解析或连接到地址[scrubbed]".放弃.tor-privoxy_1 |Jul 11​​ 16:10:44.000 [notice] 尝试在 3 个不同的地方解析或连接到地址[scrubbed]".放弃.刮刀_1 |2017-07-11 16:10:44 [scrapy.downloadermiddlewares.retry] 调试:重试 <GET http://quotes.toscrape.com/tag/adulthood/page/1/通过 http://splash:8050/渲染.json>(失败 1 次):500 内部服务器错误刮刀_1 |2017-07-11 16:10:44 [scrapy.downloadermiddlewares.retry] 调试:通过 http://splash 重试 <GET http://quotes.toscrape.com/tag/be-yourself/page/1/:8050/render.json>(失败 1 次):500 内部服务器错误tor-privoxy_1 |Jul 11​​ 16:10:55.000 [notice] 尝试在 3 个不同的地方解析或连接到地址[scrubbed]".放弃.tor-privoxy_1 |Jul 11​​ 16:10:55.000 [notice] 尝试在 3 个不同的地方解析或连接到地址[scrubbed]".放弃.刮刀_1 |2017-07-11 16:10:55 [scrapy.downloadermiddlewares.retry] 调试:重试 <GET http://quotes.toscrape.com/tag/success/page/1/通过 http://splash:8050/渲染.json>(失败 1 次):500 内部服务器错误刮刀_1 |2017-07-11 16:10:55 [scrapy.downloadermiddlewares.retry] 调试:重试 <GET http://quotes.toscrape.com/tag/books/page/1/通过 http://splash:8050/渲染.json>(失败 1 次):500 内部服务器错误tor-privoxy_1 |Jul 11​​ 16:10:56.000 [notice] 尝试在 3 个不同的地方解析或连接到地址[scrubbed]".放弃.刮刀_1 |2017-07-11 16:10:56 [scrapy.downloadermiddlewares.retry] 调试:重试 <GET http://quotes.toscrape.com/通过 http://splash:8050/render.json>(失败 1 次):500 内部服务器错误tor-privoxy_1 |Jul 11​​ 16:10:57.000 [notice] 尝试在 3 个不同的地方解析或连接到地址[scrubbed]".放弃.tor-privoxy_1 |Jul 11​​ 16:10:57.000 [notice] 尝试在 3 个不同的地方解析或连接到地址[scrubbed]".放弃.刮刀_1 |2017-07-11 16:10:57 [scrapy.downloadermiddlewares.retry] 调试:重试 <GET http://quotes.toscrape.com/tag/classic/page/1/通过 http://splash:8050/渲染.json>(失败 1 次):500 内部服务器错误刮刀_1 |2017-07-11 16:10:57 [scrapy.downloadermiddlewares.retry] 调试:重试 <GET http://quotes.toscrape.com/tag/aliteracy/page/1/通过 http://splash:8050/渲染.json>(失败 1 次):500 内部服务器错误

为了简洁起见,我中断了这个过程.scrapertor-privoxy 服务似乎交替抱怨 500 Internal Service Error 并且无法解决或连接到地址',分别.

我正在努力弄清楚为什么 http_proxy 和 Splash 不能协同工作".有人能指出我正确的方向吗?

解决方案

Following the Aquarium template project (https://github.com/TeamHG-Memex/aquarium),我发现诀窍是让Splash使用Tor,而不是直接使用spider.

我改编的项目具有以下结构:

<预><代码>.├── docker-compose.yml├── 例子│ ├── Dockerfile│ ├── scrapy.cfg│ └── scrashtest│ ├── __init__.py│ ├── settings.py│ └── 蜘蛛│ ├── __init__.py│ └── 引号.py└── 飞溅└── 代理配置文件└── default.ini

docker-compose.yml

版本:'3'服务:刮刀:构建:./示例链接:- 飞溅Tor-privoxy:图片:rdsubhas/tor-privoxy-alpine溅:图片:scrapinghub/飞溅卷:- ./splash/proxy-profiles:/etc/splash/proxy-profiles:ro链接:- Tor-privoxy

我已经将 proxy-profiles 目录作为一个卷安装到 http://splash.readthedocs.io/en/stable/api.html#proxy-profiles.default.ini 读取

[代理]主机=tor-privoxy端口=8118

(我还注意到必须将其称为 default.ini).

通过这个设置,在 docker-compose builddocker-compose up 上,scraper 使用 Splash 成功运行.

I'm trying to run a Scrapy spider with two 'extensions':

  1. Splash for rendering JavaScript,
  2. Tor-Privoxy to provide anonymity.

As an example, I'm using the scraper of quotes.toscrape.com in https://github.com/scrapy-plugins/scrapy-splash/tree/master/example. Here is my directory structure:

.
├── docker-compose.yml
└── example
    ├── Dockerfile
    ├── scrapy.cfg
    └── scrashtest
        ├── __init__.py
        ├── settings.py
        └── spiders
            ├── __init__.py
            └── quotes.py

where the example directory is cloned from the scrapy-splash repository. I've added the following docker-compose.yml file:

version: '3'

services:
  scraper:
    build: ./example
    environment:
      - http_proxy=http://tor-privoxy:8118
    links:
      - tor-privoxy
      - splash

  tor-privoxy:
    image: rdsubhas/tor-privoxy-alpine

  splash:
    image: scrapinghub/splash

where in the settings.py file I've changed the SPLASH_URL:

# SPLASH_URL = 'http://127.0.0.1:8050/'
SPLASH_URL = 'http://splash:8050'

Because Splash is running not on the localhost, but in a separate linked container named splash. The Dockerfile for the scraper is

FROM python:alpine
RUN apk --update add libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl bash
RUN pip install scrapy scrapy-splash
COPY . /scraper
WORKDIR /scraper
CMD ["scrapy", "crawl", "quotes"]

The problem is that when I run this using docker-compose build and docker-compose up, I get the following logs:

Starting examplecompose_tor-privoxy_1
Starting examplecompose_splash_1
Recreating examplecompose_scraper_1
Attaching to examplecompose_splash_1, examplecompose_tor-privoxy_1, examplecompose_scraper_1
splash_1       | 2017-07-11 16:10:13+0000 [-] Log opened.
splash_1       | 2017-07-11 16:10:13.794595 [-] Splash version: 3.0
tor-privoxy_1  | 2017-07-11 16:10:13.568 7f08e999eee8 Info: Privoxy version 3.0.23
tor-privoxy_1  | 2017-07-11 16:10:13.568 7f08e999eee8 Info: Program name: privoxy
tor-privoxy_1  | Jul 11 16:10:13.578 [notice] Tor v0.2.6.10 (git-58c51dc6087b0936) running on Linux with Libevent 2.0.22-stable, OpenSSL 1.0.2d and Zlib 1.2.8.
tor-privoxy_1  | Jul 11 16:10:13.578 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
splash_1       | 2017-07-11 16:10:13.795925 [-] Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2
splash_1       | 2017-07-11 16:10:13.796204 [-] Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609]
tor-privoxy_1  | Jul 11 16:10:13.578 [notice] Configuration file "/etc/tor/torrc" not present, using reasonable defaults.
tor-privoxy_1  | Jul 11 16:10:13.581 [notice] Opening Socks listener on 127.0.0.1:9050
splash_1       | 2017-07-11 16:10:13.796541 [-] Open files limit: 1048576
tor-privoxy_1  | Jul 11 16:10:13.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
splash_1       | 2017-07-11 16:10:13.796706 [-] Can't bump open files limit
tor-privoxy_1  | Jul 11 16:10:13.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
splash_1       | 2017-07-11 16:10:13.903844 [-] Xvfb is started: ['Xvfb', ':1896918638', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
splash_1       | QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
tor-privoxy_1  | Jul 11 16:10:13.000 [warn] You are running Tor as root. You don't need to, and you probably shouldn't.
splash_1       | 2017-07-11 16:10:13.984515 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
tor-privoxy_1  | Jul 11 16:10:13.000 [notice] Bootstrapped 0%: Starting
splash_1       | 2017-07-11 16:10:14.041562 [-] verbosity=1
splash_1       | 2017-07-11 16:10:14.041732 [-] slots=50
tor-privoxy_1  | Jul 11 16:10:13.000 [notice] Bootstrapped 5%: Connecting to directory server
splash_1       | 2017-07-11 16:10:14.041806 [-] argument_cache_max_entries=500
tor-privoxy_1  | Jul 11 16:10:13.000 [notice] Bootstrapped 80%: Connecting to the Tor network
splash_1       | 2017-07-11 16:10:14.043083 [-] Web UI: enabled, Lua: enabled (sandbox: enabled)
splash_1       | 2017-07-11 16:10:14.044088 [-] Site starting on 8050
splash_1       | 2017-07-11 16:10:14.044240 [-] Starting factory <twisted.web.server.Site object at 0x7f73a4e4b3c8>
tor-privoxy_1  | Jul 11 16:10:14.000 [notice] Bootstrapped 85%: Finishing handshake with first hop
scraper_1      | 2017-07-11 16:10:15 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrashtest)
scraper_1      | 2017-07-11 16:10:15 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scrashtest', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'NEWSPIDER_MODULE': 'scrashtest.spiders', 'SPIDER_MODULES': ['scrashtest.spiders']}
scraper_1      | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled extensions:
scraper_1      | ['scrapy.extensions.corestats.CoreStats',
scraper_1      |  'scrapy.extensions.telnet.TelnetConsole',
scraper_1      |  'scrapy.extensions.memusage.MemoryUsage',
scraper_1      |  'scrapy.extensions.logstats.LogStats']
scraper_1      | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
scraper_1      | ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
scraper_1      |  'scrapy_splash.SplashCookiesMiddleware',
scraper_1      |  'scrapy_splash.SplashMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.stats.DownloaderStats']
scraper_1      | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled spider middlewares:
scraper_1      | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
scraper_1      |  'scrapy_splash.SplashDeduplicateArgsMiddleware',
scraper_1      |  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
scraper_1      |  'scrapy.spidermiddlewares.referer.RefererMiddleware',
scraper_1      |  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
scraper_1      |  'scrapy.spidermiddlewares.depth.DepthMiddleware']
scraper_1      | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled item pipelines:
scraper_1      | []
scraper_1      | 2017-07-11 16:10:15 [scrapy.core.engine] INFO: Spider opened
scraper_1      | 2017-07-11 16:10:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
scraper_1      | 2017-07-11 16:10:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
tor-privoxy_1  | Jul 11 16:10:16.000 [notice] Bootstrapped 90%: Establishing a Tor circuit
tor-privoxy_1  | Jul 11 16:10:17.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
tor-privoxy_1  | Jul 11 16:10:17.000 [notice] Bootstrapped 100%: Done
tor-privoxy_1  | Jul 11 16:10:17.000 [warn] Received http status code 404 ("Not found") from server '216.218.222.10:443' while fetching "/tor/keys/fp/585769C78764D58426B8B52B6651A5A71137189A+80550987E1D626E3EBA5E5E75A458DE0626D088C".
scraper_1      | 2017-07-11 16:10:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
scraper_1      | 2017-07-11 16:10:29 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.goodreads.com': <GET https://www.goodreads.com/quotes>
scraper_1      | 2017-07-11 16:10:29 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'scrapinghub.com': <GET https://scrapinghub.com>
tor-privoxy_1  | Jul 11 16:10:44.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
tor-privoxy_1  | Jul 11 16:10:44.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1      | 2017-07-11 16:10:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/adulthood/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
scraper_1      | 2017-07-11 16:10:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/be-yourself/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
tor-privoxy_1  | Jul 11 16:10:55.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
tor-privoxy_1  | Jul 11 16:10:55.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1      | 2017-07-11 16:10:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/success/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
scraper_1      | 2017-07-11 16:10:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/books/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
tor-privoxy_1  | Jul 11 16:10:56.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1      | 2017-07-11 16:10:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
tor-privoxy_1  | Jul 11 16:10:57.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
tor-privoxy_1  | Jul 11 16:10:57.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1      | 2017-07-11 16:10:57 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/classic/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
scraper_1      | 2017-07-11 16:10:57 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/aliteracy/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error

where I've interrupted the process for brevity. It seems like the scraper and tor-privoxy services are alternately complaining about a 500 Internal Service Error and not being able to 'resolve or connect to address', respectively.

I'm struggling to figure out why the http_proxy and Splash don't 'work together'. Can anyone point me in the right direction?

解决方案

Following the Aquarium template project (https://github.com/TeamHG-Memex/aquarium), I found that the trick is to make Splash use Tor, not the spider directly.

My adapted project has the following structure:

.
├── docker-compose.yml
├── example
│   ├── Dockerfile
│   ├── scrapy.cfg
│   └── scrashtest
│       ├── __init__.py
│       ├── settings.py
│       └── spiders
│           ├── __init__.py
│           └── quotes.py
└── splash
    └── proxy-profiles
        └── default.ini

and the docker-compose.yml is

version: '3'

services:
  scraper:
    build: ./example
    links:
      - splash

  tor-privoxy:
    image: rdsubhas/tor-privoxy-alpine

  splash:
    image: scrapinghub/splash
    volumes:
      - ./splash/proxy-profiles:/etc/splash/proxy-profiles:ro
    links:
      - tor-privoxy

where I've mounted the proxy-profiles directory as a volume into the splash container following http://splash.readthedocs.io/en/stable/api.html#proxy-profiles. The default.ini reads

[proxy]

host=tor-privoxy
port=8118

(I also noticed it is essential to call it default.ini).

With this setup, upon docker-compose build and docker-compose up the scraper runs successfully using Splash.

这篇关于如何在 Docker Compose 中通过 Splash 和 Tor over Privoxy 使用 Scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆