HTTP 403使用Python Scrapy时的响应 [英] HTTP 403 Responses when using Python Scrapy

查看:432
本文介绍了HTTP 403使用Python Scrapy时的响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Windows Vista 64位上使用Python.org版本2.7 64位。我一直在测试以下Scrapy代码以递归方式刮取网站www.whoscored.com上的所有页面,这是用于足球统计:

I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I have been testing the following Scrapy code to recursively scrape all the pages at the site www.whoscored.com, which is for football statistics:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/"]
    rules = [Rule(SgmlLinkExtractor(allow=()), 
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    ]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
        scripts = response.selector.xpath("normalize-space(//title)")
        for scripts in scripts:
            body = response.xpath('//p').extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')  


execute(['scrapy','crawl','goal3'])

代码执行没有任何错误,但是刮掉了4623页,217得到了200的HTTP响应代码,2得到了302的代码,4404得到了403响应。任何人都可以在代码中看到任何明显的原因,为什么会出现这种情况?这可能是来自网站的反刮痧措施吗?通常的做法是减慢为阻止这种情况而提交的提交数量吗?

The code is executing without any errors, however of the 4623 pages scraped, 217 got a HTTP response code of 200, 2 got a code of 302 and 4404 got a 403 response. Can anyone see anything immediately obvious in the code as to why this might be? Could this be an anti Scraping measure from the site? Is it usual practice to slow the number of submissions made to stop this happening?

谢谢

推荐答案

HTTP状态代码 403 绝对意味着禁止访问/拒绝访问

HTTP状态代码302用于重定向请求。无需担心它们。

您的代码似乎没有错。

HTTP Status Code 403 definitely means Forbidden / Access Denied.
HTTP Status Code 302 is for redirection of requests. No need to worry about them.
Nothing seems to be wrong in your code.

是的,该网站实施的绝对是一种反刮削措施

请参阅Scrapy Docs中的这些指南: 避免被禁止

Refer these guidelines from Scrapy Docs: Avoid Getting Banned

另外,您应该考虑暂停和恢复抓取

这篇关于HTTP 403使用Python Scrapy时的响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆