使用scrapy访问网页时返回405状态码错误 [英] webpage returns 405 status code error when accessed with scrapy

查看:392
本文介绍了使用scrapy访问网页时返回405状态码错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 scrapy 抓取以下 URL -

https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n>

但是,它总是最终给出状态 405 错误.我已经搜索过这个话题,但他们总是说它发生在请求方法不正确时,比如用 POST 代替 GET.但这里肯定不是这种情况.

这是我的蜘蛛代码 -

导入scrapy类sampleSpider(scrapy.Spider):AUTOTHROTTLE_ENABLED = 真名称 = '测试'start_urls = ['https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n']定义解析(自我,响应):屈服 {'响应' : response.body_as_unicode(),}

这是我运行刮板时得到的日志 -

PS D:\>scrapy runpider tst.py -o tst.csv2017-06-26 19:20:49 [scrapy.utils.log] 信息:Scrapy 1.3.0 开始(机器人:scrapybot)2017-06-26 19:20:49 [scrapy.utils.log] 信息:覆盖设置:{'FEED_FORMAT':'csv','FEED_URI':'tst.csv'}2017-06-26 19:20:49 [scrapy.middleware] 信息:启用扩展:['scrapy.extensions.feedexport.FeedExporter','scrapy.extensions.logstats.LogStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.corestats.CoreStats']2017-06-26 19:20:50 [scrapy.middleware] 信息:启用下载器中间件:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']2017-06-26 19:20:50 [scrapy.middleware] 信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']2017-06-26 19:20:50 [scrapy.middleware] 信息:启用项目管道:[]2017-06-26 19:20:50 [scrapy.core.engine] 信息:Spider 打开2017-06-26 19:20:50 [scrapy.extensions.logstats] 信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟))2017-06-26 19:20:50 [scrapy.extensions.telnet] 调试:Telnet 控制台监听 127.0.0.1:60232017-06-26 19:20:51 [scrapy.core.engine] 调试:爬网 (405) <GET https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n>(参考:无)2017-06-26 19:20:51 [scrapy.spidermiddlewares.httperror] 信息:忽略响应 <405 https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook>:HTTP 状态代码未处理或不允许2017-06-26 19:20:51 [scrapy.core.engine] 信息:关闭蜘蛛(已完成)2017-06-26 19:20:51 [scrapy.statscollectors] 信息:倾销 Scrapy 统计信息:{'下载器/请求字节':306,'下载者/请求计数':1,'下载器/request_method_count/GET': 1,下载器/响应字节":9360,'下载者/响应计数':1,'下载器/response_status_count/405':1,'finish_reason': '完成','finish_time': datetime.datetime(2017, 6, 26, 13, 50, 51, 432000),'log_count/DEBUG': 2,'log_count/INFO': 8,'response_received_count':1,'调度程序/出队':1,'调度程序/出队/内存':1,'调度程序/排队':1,'调度程序/排队/内存':1,'start_time': datetime.datetime(2017, 6, 26, 13, 50, 50, 104000)}2017-06-26 19:20:51 [scrapy.core.engine] 信息:Spider 关闭(已完成)

任何帮助将不胜感激.提前致谢.

解决方案

我在尝试抓取 www.funda 时遇到了类似的问题.nl 并通过

解决它

  1. 更改用户代理(使用 https://pypi.org/project/scrapy-随机用户代理/),
  2. 使用Scrapy Splash.

这可能也适用于您尝试抓取的网站(尽管我尚未对此进行测试).

I am trying to scrap below URL with scrapy -

https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n

but, It always ends up giving status 405 error. I have searched about this topic but they always say that it occurs when the request method is incorrect, like POST in place of GET. But this is surely not the case here.

here is my code for spider -

import scrapy

class sampleSpider(scrapy.Spider):
    AUTOTHROTTLE_ENABLED = True
    name = 'test'
    start_urls = ['https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n']

    def parse(self, response):


        yield {
            'response' : response.body_as_unicode(),
        }

and here is the log I get when I run the scraper -

PS D:\> scrapy runspider tst.py -o tst.csv
2017-06-26 19:20:49 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
2017-06-26 19:20:49 [scrapy.utils.log] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'tst.csv'}
2017-06-26 19:20:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-06-26 19:20:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-26 19:20:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-26 19:20:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-26 19:20:50 [scrapy.core.engine] INFO: Spider opened
2017-06-26 19:20:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min
)
2017-06-26 19:20:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-26 19:20:51 [scrapy.core.engine] DEBUG: Crawled (405) <GET https://www.realtor.ca/Residential/Single-Family/1827
9532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n> (referer: None)
2017-06-26 19:20:51 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.realtor.ca/Residential
/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook>: HTTP status code is not handled or
 not allowed
2017-06-26 19:20:51 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-26 19:20:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 306,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 9360,
 'downloader/response_count': 1,
 'downloader/response_status_count/405': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 6, 26, 13, 50, 51, 432000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 6, 26, 13, 50, 50, 104000)}
2017-06-26 19:20:51 [scrapy.core.engine] INFO: Spider closed (finished)

Any help will be very much appreciated. Thank you in advance.

解决方案

I encountered a similar problem trying to scrape www.funda.nl and solved it by

  1. changing the user agent (using https://pypi.org/project/scrapy-random-useragent/),
  2. using Scrapy Splash.

This may work for the website you're trying to scrape as well (although I haven't tested this).

这篇关于使用scrapy访问网页时返回405状态码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆