如何通过 TOR 通过 Polipo 使用 Scrapy 连接到 https 站点? [英] How to connect to https site with Scrapy via Polipo over TOR?

查看:43
本文介绍了如何通过 TOR 通过 Polipo 使用 Scrapy 连接到 https 站点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

不完全确定这里的问题是什么.

Not entirely sure what the problem is here.

运行 Python 2.7.3 和 Scrapy 0.16.5

Running Python 2.7.3, and Scrapy 0.16.5

我创建了一个非常简单的 Scrapy 蜘蛛来测试连接到我的本地 Polipo 代理,以便我可以通过 TOR 发送请求.我的蜘蛛的基本代码如下:

I've created a very simple Scrapy spider to test connecting to my local Polipo proxy so I can send requests out via TOR. Basic code of my spider is as follows:

from scrapy.spider import BaseSpider

class TorSpider(BaseSpider):
    name = "tor"
    allowed_domains = ["check.torproject.org"]
    start_urls = [
        "https://check.torproject.org"
    ]

    def parse(self, response):
        print response.body

对于我的代理中间件,我已经定义了:

For my proxy middleware, I've defined:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = settings.get('HTTP_PROXY')

我的设置文件中的 HTTP_PROXY 被定义为 HTTP_PROXY = 'http://localhost:8123'.

My HTTP_PROXY in my settings file is defined as HTTP_PROXY = 'http://localhost:8123'.

现在,如果我将起始 URL 更改为 http://check.torproject.org,则一切正常,没有问题.

Now, if I change my start URL to http://check.torproject.org, everything works fine, no problems.

如果我尝试针对 https://check.torproject.org 运行,我每次都会收到 400 Bad Request 错误(我也尝试过不同的 https://站点,并且所有站点都有相同的问题):

If I attempt to run against https://check.torproject.org, I get a 400 Bad Request error every time (I've also tried different https:// sites, and all of them have the same problem):

2013-07-23 21:36:18+0100 [scrapy] INFO: Scrapy 0.16.5 started (bot: arachnid)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, ProxyMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-23 21:36:18+0100 [tor] INFO: Spider opened
2013-07-23 21:36:18+0100 [tor] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 1 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 2 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Gave up retrying <GET https://check.torproject.org> (failed 3 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Crawled (400) <GET https://check.torproject.org> (referer: None)
2013-07-23 21:36:18+0100 [tor] INFO: Closing spider (finished)

为了仔细检查我的 TOR/Polipo 设置是否有问题,我可以在终端中运行以下 curl 命令,并正常连接:curl --proxy localhost:8123 https://check.torproject.org/

And just to double check that it isn't something wrong with my TOR/Polipo set up, I'm able to run the following curl command in a terminal, and connect fine: curl --proxy localhost:8123 https://check.torproject.org/

对这里有什么问题有什么建议吗?

Any suggestions as to what's wrong here?

推荐答案

试试

rq.meta['proxy'] = 'http://127.0.0.1:8123'

在我的情况下它是有效的

In my case it's works

这篇关于如何通过 TOR 通过 Polipo 使用 Scrapy 连接到 https 站点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆