检查网址是否出现404错误(scrapy) [英] Checking a url for a 404 error scrapy

查看：317 发布时间：2020/11/25 19:34:32 python web-scraping http-status-code-404 scrapy

本文介绍了检查网址是否出现404错误(scrapy)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在浏览一组页面，但不能确定有多少个页面，但是当前页面由url中显示的简单数字表示(例如"

I'm going through a set of pages and I'm not certain how many there are, but the current page is represented by a simple number present in the url (e.g. "http://www.website.com/page/1")

我想在scrapy中使用for循环来增加页面上的当前猜测，并在页面到达404时停止.我知道从请求返回的响应中包含此信息，但是我不确定如何自动从请求中获取响应.

I would like to use a for loop in scrapy to increment the current guess at the page and stop when it reaches a 404. I know the response that is returned from the request contains this information, but I'm not sure how to automatically get a response from a request.

关于如何执行此操作的任何想法?

Any ideas on how to do this?

目前，我的代码类似于:

Currently my code is something along the lines of :

def start_requests(self):
    baseUrl = "http://website.com/page/"
    currentPage = 0
    stillExists = True
    while(stillExists):
        currentUrl = baseUrl + str(currentPage)
        test = Request(currentUrl)
        if test.response.status != 404: #This is what I'm not sure of
            yield test
            currentPage += 1
        else:
            stillExists = False

推荐答案

您可以执行以下操作:

from __future__ import print_function
import urllib2

baseURL = "http://www.website.com/page/"

for n in xrange(100):
    fullURL = baseURL + str(n)
    #print fullURL
    try:
        req = urllib2.Request(fullURL)
        resp = urllib2.urlopen(req)
        if resp.getcode() == 404:
            #Do whatever you want if 404 is found
            print ("404 Found!")
        else:
            #Do your normal stuff here if page is found.
            print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
    except:
        print ("Could not connect to URL: {0} ".format(fullURL))

这会遍历整个范围，并尝试通过urllib2连接到每个URL.我不知道scapy还是您的示例函数如何打开URL，但这是一个有关如何通过urllib2进行操作的示例.

This iterates through the range and attempts to connect to each URL via urllib2. I don't know scapy or how your example function opens the URL but this is an example with how to do it via urllib2.

请注意，大多数使用这种URL格式的站点通常都在运行CMS，该CMS可以自动将不存在的页面重定向到自定义404 - Not Found页面，该页面仍将显示为HTTP状态代码200.在这种情况下，这是查找可能会显示但实际上只是自定义404页面的页面的最佳方法，您应该进行一些屏幕抓取，并寻找在正常"页面返回期间可能不会出现的任何内容，例如显示找不到"或与自定义404页相似且独特的内容.

Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom 404 - Not Found page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.

这篇关于检查网址是否出现404错误(scrapy)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

检查网址是否出现404错误(scrapy) [英] Checking a url for a 404 error scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

检查网址是否出现404错误(scrapy) [英] Checking a url for a 404 error scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭