Scrapy 和响应状态代码:如何检查它? [英] Scrapy and response status code: how to check against it?

查看:30
本文介绍了Scrapy 和响应状态代码:如何检查它?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scrapy 抓取我的站点地图,以检查 404、302 和 200 页.但我似乎无法获得响应代码.到目前为止,这是我的代码:

I'm using scrapy to crawl my sitemap, to check for 404, 302 and 200 pages. But i can't seem to be able to get the response code. This is my code so far:

from scrapy.contrib.spiders import SitemapSpider


class TothegoSitemapHomesSpider(SitemapSpider):
    name ='tothego_homes_spider'

    ## robe che ci servono per tothego ##
   sitemap_urls = []
   ok_log_file =       '/opt/Workspace/myapp/crawler/valid_output/ok_homes'
   bad_log_file =      '/opt/Workspace/myapp/crawler/bad_homes'
   fourohfour =        '/opt/Workspace/myapp/crawler/404/404_homes'

   def __init__(self, **kwargs):
        SitemapSpider.__init__(self)

        if len(kwargs) > 1:
            if 'domain' in kwargs:
                self.sitemap_urls = ['http://url_to_sitemap%s/sitemap.xml' % kwargs['domain']]

            if 'country' in kwargs:
                self.ok_log_file += "_%s.txt" % kwargs['country']
                self.bad_log_file += "_%s.txt" % kwargs['country']
                self.fourohfour += "_%s.txt" % kwargs['country']

        else:
            print "USAGE: scrapy [crawler_name] -a country=[country] -a domain=[domain] 
With [crawler_name]:
- tothego_homes_spider
- tothego_cars_spider
- tothego_jobs_spider
"
            exit(1)

    def parse(self, response):
        try:
            if response.status == 404:
                ## 404 tracciate anche separatamente
                self.append(self.bad_log_file, response.url)
                self.append(self.fourohfour, response.url)

            elif response.status == 200:
                ## printa su ok_log_file
                self.append(self.ok_log_file, response.url)
            else:
                self.append(self.bad_log_file, response.url)

        except Exception, e:
            self.log('[eccezione] : %s' % e)
            pass

    def append(self, file, string):
        file = open(file, 'a')
        file.write(string+"
")
        file.close()

从scrapy的文档中,他们说 response.status 参数是一个整数,对应于响应的状态代码.到目前为止,它只记录了 200 个状态 url,而 302 没有写入输出文件(但我可以在 crawl.log 中看到重定向).那么,我该怎么做才能捕获" 302 请求并保存这些 url?

From scrapy's docs, they said that response.status parameter is an integer corresponding to the status code of the response. So far, it logs only the 200 status urls, while the 302 aren't written on the output file (but i can see the redirects in crawl.log). So, what do i have to do to "trap" the 302 requests and save those urls?

推荐答案

http://readthedocs.org/docs/scrapy/en/latest/topics/spider-middleware.html#module-scrapy.contrib.spidermiddleware.httperror

假设启用了默认的蜘蛛中间件,超出 200-300 范围的响应代码会被 HttpErrorMiddleware 过滤掉.您可以通过在蜘蛛上设置 handle_httpstatus_list 属性来告诉中间件您要处理 404.

Assuming default spider middleware is enabled, response codes outside of the 200-300 range are filtered out by HttpErrorMiddleware. You can tell the middleware you want to handle 404s by setting the handle_httpstatus_list attribute on your spider.

class TothegoSitemapHomesSpider(SitemapSpider):
    handle_httpstatus_list = [404]

这篇关于Scrapy 和响应状态代码:如何检查它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆