检测网页是否被更改 [英] detect if a web page is changed

查看:63
本文介绍了检测网页是否被更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的 python 应用程序中,我必须阅读许多网页来收集数据.为了减少 http 调用,我只想获取更改的页面.我的问题是我的代码总是告诉我页面已更改(代码 200),但实际上并没有.

In my python application I have to read many web pages to collect data. To decrease the http calls I would like to fetch only changed pages. My problem is that my code always tells me that the pages have been changed (code 200) but in reality it is not.

这是我的代码:

from models import mytab
import re
import urllib2
from wsgiref.handlers import format_date_time
from datetime import datetime
from time import mktime

def url_change():
    urls = mytab.objects.all()
    # this is some urls:
    # http://www.venere.com/it/pensioni/venezia/pensione-palazzo-guardi/#reviews
    # http://www.zoover.it/italia/sardegna/cala-gonone/san-francisco/hotel
    # http://www.orbitz.com/hotel/Italy/Venice/Palazzo_Guardi.h161844/#reviews
    # http://it.hotels.com/ho292636/casa-del-miele-susegana-italia/
    # http://www.expedia.it/Venezia-Hotel-Palazzo-Guardi.h1040663.Hotel-Information#reviews
    # ...

    for url in urls:
        request = urllib2.Request(url.url)
        if url.last_date == None:
            now = datetime.now()
            stamp = mktime(now.timetuple())
            url.last_date = format_date_time(stamp)
            url.save()

        request.add_header("If-Modified-Since", url.last_date)

        try:
            response = urllib2.urlopen(request) # Make the request
            # some actions
            now = datetime.now()
            stamp = mktime(now.timetuple())
            url.last_date = format_date_time(stamp)
            url.save()
        except urllib2.HTTPError, err:
            if err.code == 304:
                print "nothing...."
            else:
                print "Error code:", err.code 
                pass

我不明白出了什么问题.有人可以帮我吗?

I do not understand what has gone wrong. Can anyone help me?

推荐答案

当您发送If-Modified-Since"标头时,Web 服务器不需要发送 304 标头作为响应.他们可以随意发送 HTTP 200 并再次发送整个页面.

Web servers aren't required to send a 304 header as the response when you send an 'If-Modified-Since' header. They're free to send a HTTP 200 and send the entire page again.

发送If-Modified-Since"或If-None-Since"会提醒服务器您需要缓存响应(如果可用).这就像发送一个Accept-Encoding: gzip, deflate"标头——你只是告诉服务器你会接受某些东西,而不是要求它.

Sending a 'If-Modified-Since' or 'If-None-Since' alerts the server that you'd like a cached response if available. It's like sending an 'Accept-Encoding: gzip, deflate' header -- you're just telling the server you'll accept something, not requiring it.

这篇关于检测网页是否被更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆