检查网址是否出现404错误(scrapy) [英] Checking a url for a 404 error scrapy
问题描述
我正在浏览一组页面,但不能确定有多少个页面,但是当前页面由url中显示的简单数字表示(例如"
I'm going through a set of pages and I'm not certain how many there are, but the current page is represented by a simple number present in the url (e.g. "http://www.website.com/page/1")
我想在scrapy中使用for循环来增加页面上的当前猜测,并在页面到达404时停止.我知道从请求返回的响应中包含此信息,但是我不确定如何自动从请求中获取响应.
I would like to use a for loop in scrapy to increment the current guess at the page and stop when it reaches a 404. I know the response that is returned from the request contains this information, but I'm not sure how to automatically get a response from a request.
关于如何执行此操作的任何想法?
Any ideas on how to do this?
目前,我的代码类似于:
Currently my code is something along the lines of :
def start_requests(self):
baseUrl = "http://website.com/page/"
currentPage = 0
stillExists = True
while(stillExists):
currentUrl = baseUrl + str(currentPage)
test = Request(currentUrl)
if test.response.status != 404: #This is what I'm not sure of
yield test
currentPage += 1
else:
stillExists = False
推荐答案
您可以执行以下操作:
from __future__ import print_function
import urllib2
baseURL = "http://www.website.com/page/"
for n in xrange(100):
fullURL = baseURL + str(n)
#print fullURL
try:
req = urllib2.Request(fullURL)
resp = urllib2.urlopen(req)
if resp.getcode() == 404:
#Do whatever you want if 404 is found
print ("404 Found!")
else:
#Do your normal stuff here if page is found.
print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
except:
print ("Could not connect to URL: {0} ".format(fullURL))
这会遍历整个范围,并尝试通过urllib2
连接到每个URL.我不知道scapy
还是您的示例函数如何打开URL,但这是一个有关如何通过urllib2
进行操作的示例.
This iterates through the range and attempts to connect to each URL via urllib2
. I don't know scapy
or how your example function opens the URL but this is an example with how to do it via urllib2
.
请注意,大多数使用这种URL格式的站点通常都在运行CMS,该CMS可以自动将不存在的页面重定向到自定义404 - Not Found
页面,该页面仍将显示为HTTP状态代码200.在这种情况下,这是查找可能会显示但实际上只是自定义404页面的页面的最佳方法,您应该进行一些屏幕抓取,并寻找在正常"页面返回期间可能不会出现的任何内容,例如显示找不到"或与自定义404页相似且独特的内容.
Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom 404 - Not Found
page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.
这篇关于检查网址是否出现404错误(scrapy)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!