Scrapy+Splash 为任何站点返回 403 [英] Scrapy+Splash return 403 for any site

查看:62
本文介绍了Scrapy+Splash 为任何站点返回 403的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于某种原因,我在使用 Splash 时对任何请求都有 403.我做错了什么?

For some reason, I have 403 for any request when using Splash. What I do wrong?

遵循https://github.com/scrapy-plugins/scrapy-splash 我设置了所有设置:

Following https://github.com/scrapy-plugins/scrapy-splash I set up all the settings:

SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

开始与 docker 一起飞溅

Started splash with docker

sudo docker run -p 8050:8050 scrapinghub/splash

sudo docker run -p 8050:8050 scrapinghub/splash

蜘蛛代码:

import scrapy

from scrapy import Selector
from scrapy_splash import SplashRequest


class VestiaireSpider(scrapy.Spider):
    name = "vestiaire"
    base_url = "https://www.vestiairecollective.com"
    rotate_user_agent = True

    def start_requests(self):
        urls = ["https://www.vestiairecollective.com/men-clothing/jeans/"]
        for url in urls:
            yield SplashRequest(url=url, callback=self.parse, meta={'args': {"wait": 0.5}})

    def parse(self, response):
        data = Selector(response)
        category_name = data.xpath('//h1[@class="campaign campaign-title clearfix"]/text()').extract_first().strip()
        self.log(category_name)

然后我运行蜘蛛:

scrapy 爬取测试

scrapy crawl test

并返回请求 url 的 403:

And get back a 403 for requesting url:

2017-12-19 22:55:17 [scrapy.utils.log] 信息:Scrapy 1.4.0 启动(机器人:爬虫)2017-12-19 22:55:17 [scrapy.utils.log] 信息:覆盖设置:{'DUPEFILTER_CLASS':'scrapy_splash.SplashAwareDupeFilter', 'CONCURRENT_REQUESTS': 10,'NEWSPIDER_MODULE': 'crawlers.spider', 'SPIDER_MODULES':['crawlers.spider'],'ROBOTSTXT_OBEY':真,'COOKIES_ENABLED':错误,BOT_NAME":爬虫",HTTPCACHE_STORAGE":'scrapy_splash.SplashAwareFSCacheStorage'} 2017-12-19 22:55:17[scrapy.middleware] 信息:启用扩展:['scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.corestats.CoreStats'] 2017-12-19 22:55:17[scrapy.middleware] 信息:启用下载器中间件:['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy_splash.SplashCookiesMiddleware','scrapy_splash.SplashMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-1922:55:17 [scrapy.middleware] 信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy_splash.SplashDeduplicateArgsMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-19 22:55:17[scrapy.middleware] 信息:启用项目管道:['scrapy.pipelines.images.ImagesPipeline'] 2017-12-19 22:55:17[scrapy.core.engine] INFO:Spider 开启 2017-12-19 22:55:17[scrapy.extensions.logstats] 信息:抓取了 0 页(以 0 页/分钟的速度),刮掉 0 个项目(以 0 个项目/分钟的速度) 2017-12-19 22:55:17[scrapy.extensions.telnet] DEBUG:Telnet 控制台监听127.0.0.1:6023 2017-12-19 22:55:20 [scrapy.core.engine] 调试:爬网(200)https://www.vestiairecollective.com/robots.txt>(参考:无)2017-12-19 22:55:22 [scrapy.core.engine] 调试:爬行(403)http://localhost:8050/robots.txt> (referer: None) 2017-12-1922:55:23 [scrapy.core.engine] 调试:爬行 (403) https://www.vestiairecollective.com/men-clothing/jeans/viahttp://localhost:8050/render.html> (referer: None) 2017-12-1922:55:23[scrapy.spidermiddlewares.httperror] 信息:忽略响应 <403https://www.vestiairecollective.com/men-clothing/jeans/>: HTTP 状态代码未处理或不允许 2017-12-19 22:55:23[scrapy.core.engine] INFO: Closing spider (finished) 2017-12-1922:55:23 [scrapy.statscollectors] 信息:倾销 Scrapy 统计数据:{'downloader/request_bytes': 1254, 'downloader/request_count': 3,'下载器/request_method_count/GET': 2,'下载器/request_method_count/POST': 1,下载器/响应字节":2793,下载器/响应计数":3,'下载器/response_status_count/200':1,'downloader/response_status_count/403': 2, 'finish_reason':'finished', 'finish_time': datetime.datetime(2017, 12, 19, 20, 55,23, 440598), 'httperror/response_ignored_count': 1,'httperror/response_ignored_status_count/403': 1, 'log_count/DEBUG':4,'log_count/INFO':8,'memusage/max':53850112,'memusage/startup':53850112,'response_received_count':3,调度程序/出队":2,调度程序/出队/内存":2,'调度程序/入队':2,'调度程序/入队/内存':2,'splash/render.html/request_count': 1,'splash/render.html/response_count/403': 1, 'start_time':datetime.datetime(2017, 12, 19, 20, 55, 17, 372080)} 2017-12-1922:55:23 [scrapy.core.engine] 信息:Spider 关闭(完成)

2017-12-19 22:55:17 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: crawlers) 2017-12-19 22:55:17 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'CONCURRENT_REQUESTS': 10, 'NEWSPIDER_MODULE': 'crawlers.spiders', 'SPIDER_MODULES': ['crawlers.spiders'], 'ROBOTSTXT_OBEY': True, 'COOKIES_ENABLED': False, 'BOT_NAME': 'crawlers', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage'} 2017-12-19 22:55:17 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.corestats.CoreStats'] 2017-12-19 22:55:17 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy_splash.SplashCookiesMiddleware', 'scrapy_splash.SplashMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-19 22:55:17 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-19 22:55:17 [scrapy.middleware] INFO: Enabled item pipelines: ['scrapy.pipelines.images.ImagesPipeline'] 2017-12-19 22:55:17 [scrapy.core.engine] INFO: Spider opened 2017-12-19 22:55:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-19 22:55:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-19 22:55:20 [scrapy.core.engine] DEBUG: Crawled (200) https://www.vestiairecollective.com/robots.txt> (referer: None) 2017-12-19 22:55:22 [scrapy.core.engine] DEBUG: Crawled (403) http://localhost:8050/robots.txt> (referer: None) 2017-12-19 22:55:23 [scrapy.core.engine] DEBUG: Crawled (403) https://www.vestiairecollective.com/men-clothing/jeans/ via http://localhost:8050/render.html> (referer: None) 2017-12-19 22:55:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.vestiairecollective.com/men-clothing/jeans/>: HTTP status code is not handled or not allowed 2017-12-19 22:55:23 [scrapy.core.engine] INFO: Closing spider (finished) 2017-12-19 22:55:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1254, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 2, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 2793, 'downloader/response_count': 3, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/403': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 12, 19, 20, 55, 23, 440598), 'httperror/response_ignored_count': 1, 'httperror/response_ignored_status_count/403': 1, 'log_count/DEBUG': 4, 'log_count/INFO': 8, 'memusage/max': 53850112, 'memusage/startup': 53850112, 'response_received_count': 3, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'splash/render.html/request_count': 1, 'splash/render.html/response_count/403': 1, 'start_time': datetime.datetime(2017, 12, 19, 20, 55, 17, 372080)} 2017-12-19 22:55:23 [scrapy.core.engine] INFO: Spider closed (finished)

推荐答案

问题出在 User-Agent 中.许多站点需要它才能访问.访问该站点并避免被禁止的最简单方法是使用此库来随机化用户代理.https://github.com/cnu/scrapy-random-useragent

The problem was in User-Agent. Many sites require it for access. The easiest way to access the site and avoid a ban is to use this lib to randomize user agent. https://github.com/cnu/scrapy-random-useragent

这篇关于Scrapy+Splash 为任何站点返回 403的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆