检查网址是否出现404错误(scrapy) [英] Checking a url for a 404 error scrapy

查看:317
本文介绍了检查网址是否出现404错误(scrapy)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在浏览一组页面,但不能确定有多少个页面,但是当前页面由url中显示的简单数字表示(例如"

I'm going through a set of pages and I'm not certain how many there are, but the current page is represented by a simple number present in the url (e.g. "http://www.website.com/page/1")

我想在scrapy中使用for循环来增加页面上的当前猜测,并在页面到达40​​4时停止.我知道从请求返回的响应中包含此信息,但是我不确定如何自动从请求中获取响应.

I would like to use a for loop in scrapy to increment the current guess at the page and stop when it reaches a 404. I know the response that is returned from the request contains this information, but I'm not sure how to automatically get a response from a request.

关于如何执行此操作的任何想法?

Any ideas on how to do this?

目前,我的代码类似于:

Currently my code is something along the lines of :

def start_requests(self):
    baseUrl = "http://website.com/page/"
    currentPage = 0
    stillExists = True
    while(stillExists):
        currentUrl = baseUrl + str(currentPage)
        test = Request(currentUrl)
        if test.response.status != 404: #This is what I'm not sure of
            yield test
            currentPage += 1
        else:
            stillExists = False

推荐答案

您可以执行以下操作:

from __future__ import print_function
import urllib2

baseURL = "http://www.website.com/page/"

for n in xrange(100):
    fullURL = baseURL + str(n)
    #print fullURL
    try:
        req = urllib2.Request(fullURL)
        resp = urllib2.urlopen(req)
        if resp.getcode() == 404:
            #Do whatever you want if 404 is found
            print ("404 Found!")
        else:
            #Do your normal stuff here if page is found.
            print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
    except:
        print ("Could not connect to URL: {0} ".format(fullURL))

这会遍历整个范围,并尝试通过urllib2连接到每个URL.我不知道scapy还是您的示例函数如何打开URL,但这是一个有关如何通过urllib2进行操作的示例.

This iterates through the range and attempts to connect to each URL via urllib2. I don't know scapy or how your example function opens the URL but this is an example with how to do it via urllib2.

请注意,大多数使用这种URL格式的站点通常都在运行CMS,该CMS可以自动将不存在的页面重定向到自定义404 - Not Found页面,该页面仍将显示为HTTP状态代码200.在这种情况下,这是查找可能会显示但实际上只是自定义404页面的页面的最佳方法,您应该进行一些屏幕抓取,并寻找在正常"页面返回期间可能不会出现的任何内容,例如显示找不到"或与自定义404页相似且独特的内容.

Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom 404 - Not Found page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.

这篇关于检查网址是否出现404错误(scrapy)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆