错误 403:scrapy 中未处理或不允许 HTTP 状态代码 [英] Error 403 : HTTP status code is not handled or not allowed in scrapy

查看：30 发布时间：2022/1/4 20:58:27 python http scrapy

本文介绍了错误 403:scrapy 中未处理或不允许 HTTP 状态代码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我写的代码，用于抓取 justdial 网站.

This is the code, I have written to scrape justdial website.

import scrapy
from scrapy.http.request import Request

class JustdialSpider(scrapy.Spider):
    name = 'justdial'
    # handle_httpstatus_list = [400]
    # headers={'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
    # handle_httpstatus_list = [403, 404]
    allowed_domains = ['justdial.com']
    start_urls = ['https://www.justdial.com/Delhi-NCR/Chemists/page-1']
    # def  start_requests(self):
    #     # hdef start_requests(self):
    #     headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
    #     for url in self.start_urls:
    #         self.log("I just visited :---------------------------------- "+url)
    #         yield Request(url, headers=headers)
    def parse(self,response):
        self.log("I just visited the site:---------------------------------------------- "+response.url)
         urls = response.xpath('//a/@href').extract()
         self.log("Urls-------: "+str(urls))

这是终端中显示的错误:

This is Error is showing in Terminal:

2017-08-18 18:32:25 [scrapy.core.engine] INFO: Spider opened
2017-08-18 18:32:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-08-18 18:32:25 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache
storage in D:scrapyjustdial.scrapyhttpcache
2017-08-18 18:32:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening o
n 127.0.0.1:6023
2017-08-18 18:32:25 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.j
ustdial.com/robots.txt> (referer: None) ['cached']
2017-08-18 18:32:25 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.j
ustdial.com/Delhi-NCR/Chemists/page-1> (referer: None) ['cached']
2017-08-18 18:32:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response
 <403 https://www.justdial.com/Delhi-NCR/Chemists/page-1>: HTTP status code is n
ot handled or not allowed

我在 stackoverflow 上看到了类似的问题，我尝试了所有类似的方法，你可以在代码中看到我尝试过的内容，

I have seen the similar questions on stackoverflow i tried everything like, You can see in Code with comment what i tried,

更改了用户代理

changed the UserAgents

设置 handle_httpstatus_list = [400]

Setting handle_httpstatus_list = [400]

注意:这(https://www.justdial.com/Delhi-NCR/Chemists/page-1) 网站甚至没有在我的系统中被阻止.当我在 chrome/mozilla 中打开网站时，它正在打开.这与 (https://www.practo.com/bangalore#doctor-search) 网站也是.

Note: This (https://www.justdial.com/Delhi-NCR/Chemists/page-1) website not even blocked in my system. When i open the website in chrome/mozilla, it's opening. This is same error with (https://www.practo.com/bangalore#doctor-search) site also.

错误 403:scrapy 中未处理或不允许 HTTP 状态代码 [英] Error 403 : HTTP status code is not handled or not allowed in scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

错误 403:scrapy 中未处理或不允许 HTTP 状态代码 [英] Error 403 : HTTP status code is not handled or not allowed in scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭