没有项目的scrapy代理中间件 [英] scrapy proxy middleware without project

查看:54
本文介绍了没有项目的scrapy代理中间件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scrapy 的runspider 方法来运行我在没有项目的情况下设置和定义的蜘蛛.我正在设置我的自定义设置和下载器中间件来定义 http 代理中间件,如下所示:

I am using scrapy's runspider method to run a spider that I've setup and defined without a project. I am setting up my custom settings and Downloader Middlewares to define an http proxy middleware as follows:

custom_settings = { 'DOWNLOADER_MIDDLEWARES': { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750 } }

然后在我的请求中使用

request.meta['proxy'] = "proxy-ip:proxy-port"

让步请求

and then calling it in my request with

request.meta['proxy'] = "proxy-ip:proxy-port"

yield request

但蜘蛛没有运行并说:

文件/usr/lib/python2.7/dist-packages/twisted/internet/abstract.py", line 522, in isIPv6Address if '%' in addr: TypeError: argument of type 'NoneType' is not iterable

but the spider does not run and says:

File "/usr/lib/python2.7/dist-packages/twisted/internet/abstract.py", line 522, in isIPv6Address if '%' in addr: TypeError: argument of type 'NoneType' is not iterable

我做错了什么?

推荐答案

经过大量的挖掘(Scrapy 没有太多日志记录,恐怕),我发现这个问题可能是没有指定方案导致的在代理地址中;即,Scrapy 期望代理作为 URI 传递,因此在您的情况下,而不是:

After a lot of digging (not much logging going on in Scrapy, I'm afraid), I found that this problem can be caused by not specifying the scheme in the proxy address; i.e., Scrapy expects the proxy to be passed as a URI, so in your case, instead of:

request.meta['proxy'] = "proxy-ip:proxy-port"  # doesn't work

你想要这个:

request.meta['proxy'] = "http://proxy-ip:proxy-port"  # does work

(据我所知,http 只是被忽略了,但没有它,urlparse 无法解析其余部分).

(As far as I can make out, the http is just ignored, but without it the rest can't be parsed by urlparse).

这篇关于没有项目的scrapy代理中间件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆