如何检查网站上的值是否已更改 [英] How to check if the value on a website has changed

查看:190
本文介绍了如何检查网站上的值是否已更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我试图运行一些代码(Python 3.2)如果一个网站上的值改变,否则等待一点,并稍后检查。

Basically I'm trying to run some code (Python 3.2) if a value on a website changes, otherwise wait for a bit and check it later.

首先,我想我可以将值保存在变量中,并将其与下次运行脚本时获取的新值进行比较。但是,当脚本重新运行并初始化该变量时,该值被覆盖,所以很快就遇到了问题。

First I thought I could just save the value in a variable and compare it to the new value that was fetched the next time the script would run. But that quickly ran into problems as the value was overwritten when the script would run again and initialize that variable.

所以我试着将网页的html保存为文件,然后将其与下一次脚本运行时调用的html进行比较。没有运气,因为它不断出现假,即使没有变化。

So then I tried just saving the html of the webpage as a file and then comparing it to the html that would be called on the next time the script ran. No luck there either as it kept coming up False even when there were no changes.

接下来是pickling网页,然后尝试与html比较。有趣的是,没有在脚本内工作。但是,如果我输入file = pickle.load(open('D:\Download\htmlString.p','rb'))脚本运行后,然后file == html,它显示True当有'

Next up was pickling the webpage and then trying to compare it with the html. Interestingly that didn't work either within the script. BUT, if I type file = pickle.load( open( 'D:\Download\htmlString.p', 'rb')) after the script has run and then file == html, it shows True when there hasn't been any changes.

我有点困惑,为什么它不会工作,当脚本运行时,但如果我做以上它显示正确的答案。

I'm a bit confused as to why it won't work when the script runs but if I do the above it shows the correct answer.

编辑:感谢您的回应。我的问题不是真的关于其他方法去(尽管学习更多的方式来完成一个任务总是好的),而是为什么下面的代码不工作,当它作为一个脚本运行,但如果我在脚本运行后在提示符处重新加载pickle对象,然后针对html测试它,如果没有任何更改,它将返回True。

Thanks for the responses so far guys. The question I have wasn't really about other ways to go about this (although it's always good to learn more ways to accomplish a task!) but rather why the code below doesn't work when it's run as a script, but if I reload the pickle object at the prompt after the script has run and then test it against the html, it will return True if there hasn't been any changes.

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'rb'))
    if pickle.load( open( 'D:\\Download\\htmlString.p', 'rb')) == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )  
        print('Saving')
except: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )
    print('ERROR')


推荐答案

编辑:我没有意识到你只是寻找你的脚本的问题。这是我认为的问题,其次是我的原始答案,解决了另一种方法,你想要解决的更大的问题。

Edit: I hadn't realised you were just looking for the problem with your script. Here's what I think is the problem, followed by my original answer which addresses another approach to the bigger problem you're trying to solve.

你的脚本是一个很好的例子使用毯子的危险除了语句:你抓住一切。在这种情况下,包括您的 sys.exit(0)

Your script is a great example of the dangers of using a blanket except statement: you catch everything. Including, in this case, your sys.exit(0).

code> try block是用来捕获 D:\Download\htmlString.p 不存在的情况。该错误称为 IOError ,您可以使用(除了IOError)特别捕获它:

I'm assuming you're try block is there to catch the case where D:\Download\htmlString.p doesn't exist yet. That error is called IOError, and you can catch it specifically with except IOError:

这是你的脚本加上一些代码之前,修改为除了问题:

Here is your script plus a bit of code before to make it go, fixed for your except issue:

import sys
import pickle
import urllib2

request = urllib2.Request('http://www.iana.org/domains/example/')
response = urllib2.urlopen(request) # Make the request
htmlString = response.read()

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'rb'))
    if pickle.load( open( 'D:\\Download\\htmlString.p', 'rb')) == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )  
        print('Saving')
except IOError: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )
    print('Created new file.')

注意,您可以考虑使用 os.path 您的文件路径 - 这将有助于任何人谁想要

As a side note, you might consider using os.path for your file paths -- it will help anyone later who wants to use your script on another platform, and it saves you the ugly double back-slashes.

编辑2 :适用于您的特定网址。

Edit 2: Adapted for your specific URL.

该网页上的广告有一个动态生成的数字,会随每个网页加载情况而变化。它在所有的内容后面的结尾处,所以我们可以在那时分割HTML字符串,取上半部分,丢弃具有动态数字的部分。

There is a dynamically-generated number for the ads on that page which changes with each page-load. It's right near the end after all the content, so we can just split the HTML string at that point and take the first half, discarding the part with the dynamic number.

import sys
import pickle
import urllib2

request = urllib2.Request('http://ecal.forexpros.com/e_cal.php?duration=weekly')
response = urllib2.urlopen(request) # Make the request
# Grab everything before the dynabic double-click link
htmlString = response.read().split('<iframe src="http://fls.doubleclick')[0]

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'r'))
    if pickle.load( open( 'D:\\Download\\htmlString.p', 'r')) == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "w" ) )  
        print('Saving')
except IOError: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "w" ) )
    print('Created new file.')

您的字符串不是有效的html文档,如果这是重要的。如果是,你可能只是删除那行或什么。这可能是一个更优雅的方式,可能删除数字与正则表达式 - 但这至少满足你的问题。

Your string is not a valid html document any more, if that was important. If it was, you might just remove that line or something. There is probably a more elegant way of doing this, -- perhaps deleting the number with a regex -- but this at least satisfies your question.

原始答案 - 您的问题的替代方法。

Original Answer -- an alternate approach to your problem.

响应标头从网络服务器看起来是什么样子? HTTP指定 上次修改 属性,您可以使用它来检查内容是否已更改(假设服务器告诉真相)。如果你想节省带宽并且对你正在轮询的服务器很好,可以使用这个带有 HEAD 请求的请求,如Uku在他的回答中所示。

What do the response headers look like from the web server? HTTP specifies a Last-Modified property that you could use to check if the content has changed (assuming the server tells the truth). Use this one with a HEAD request as Uku showed in his answer, if you'd like to conserve bandwidth and be nice to the server you're polling.

还有一个 If-Modified-Since 标题,听起来像您可能想要的。

And there is also an If-Modified-Since header which sound like what you might be looking for.

我们把它们结合起来,你可能会想到这样的:

If we combine them, you might come up with something like this:

import sys
import os.path
import urllib2

url = 'http://www.iana.org/domains/example/'
saved_time_file = 'last time check.txt'

request = urllib2.Request(url)
if os.path.exists(saved_time_file):
    """ If we've previously stored a time, get it and add it to the request"""
    last_time = open(saved_time_file, 'r').read()
    request.add_header("If-Modified-Since", last_time)

try:
    response = urllib2.urlopen(request) # Make the request
except urllib2.HTTPError, err:
    if err.code == 304:
        print "Nothing new."
        sys.exit(0)
    raise   # some other http error (like 404 not found etc); re-raise it.

last_modified = response.info().get('Last-Modified', False)
if last_modified:
    open(saved_time_file, 'w').write(last_modified)
else:
    print("Server did not provide a last-modified property. Continuing...")
    """
    Alternately, you could save the current time in HTTP-date format here:
    http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.3
    This might work for some servers that don't provide Last-Modified, but do
    respect If-Modified-Since.
    """

"""
You should get here if the server won't confirm the content is old.
Hopefully that means it's new.
HTML should be in response.read().
"""

另外检查这个博客文章由Stii提供了一些灵感。我不太清楚 ETags 把它们放在我的例子中,但他的代码也检查他们。

Also check out this blog post by Stii which may provide some inspiration. I don't know enough about ETags to have put them in my example, but his code checks for them as well.

这篇关于如何检查网站上的值是否已更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆