无法在 Scrapy 项目中使用代理 [英] Unable to use proxies in Scrapy project
问题描述
我一直在尝试抓取一个网站,该网站似乎已识别并阻止了我的 IP,并且正在抛出 429 Too many requests 响应.
I have been trying to crawl a website that has seemingly identified and blocked my IP and is throwing a 429 Too many requests response.
我从此链接安装了scrapy-proxies:https://github.com/aivarsk/scrapy-代理并按照给定的说明进行操作.我从这里得到了一个代理列表:http://www.gatherproxy.com/ 现在这里是我的 settings.py 和 proxylist.txt 是什么样的:
I installed scrapy-proxies from this link: https://github.com/aivarsk/scrapy-proxies and followed the given instructions. I got a list of proxies from here: http://www.gatherproxy.com/ and now here is how my settings.py and proxylist.txt look like:
Settings.py
BOT_NAME = 'project'
SPIDER_MODULES = ['project.spiders']
NEWSPIDER_MODULE = 'project.spiders'
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [429, 500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = "filepath\proxylist.txt"
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 2
PROXY_MODE = 0
DOWNLOAD_HANDLERS = {'s3': None}
EXTENSIONS = {
'scrapy.telnet.TelnetConsole': None
}
proxylist.txt
http://195.208.172.20:8080
http://154.119.56.179:9999
http://124.12.50.43:8088
http://61.7.168.232:52136
http://122.193.188.236:8118
然而,当我运行爬虫时,出现以下错误:
Yet when I run my crawler, I get the following error:
[scrapy.proxies] DEBUG: Proxy user pass not found
我尝试在 google 上搜索特定错误,但找不到任何解决方案.
I tried to search for the specific error on google but could not find any solutions.
非常感谢您的帮助.非常感谢.
Help will be highly appreciated. Thanks a lot in advance.
推荐答案
我建议你创建你自己的中间件来指定这样的 IP:PORT 并将这个 proxies.py
中间件文件放在你的项目的 middleware
文件夹:
I suggest you to create your own middleware to specify the IP:PORT like this and place this proxies.py
middleware file inside your project's middleware
folder:
class ProxiesMiddleware(object):
def __init__(self, settings):
pass
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def process_request(self, request, spider):
request.meta['proxy'] = "http://IP:PORT"
将 ProxiesMiddleware
中间件行添加到您的 settings.py
Add ProxiesMiddleware
middleware line to your settings.py
DOWNLOADER_MIDDLEWARES = {
'yourproject.middleware.proxies.ProxiesMiddleware':400,
}
这篇关于无法在 Scrapy 项目中使用代理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!