如何知道什么时候一个网页已经被X%VB.net中改变了吗? [英] How to tell when a web page has changed by x% in VB.net?

查看:104
本文介绍了如何知道什么时候一个网页已经被X%VB.net中改变了吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图写一个小工具,它会定期检查,并告诉我,如果/当一个网页的(可以是任何URL)的内容发生了变化。我读过的其他帖子,但他们并没有真正回答我的问题(据我可以告诉)。

I'm trying to write a little utility which will check periodically and tell me if/when a web page's (could be any URL) content has changed. I've read the other postings but they don't really answer my question (as far as I can tell).

我知道静态网页有一个Last-Modified头。然而,有关动态网页?我得到了奥利奇的意见,即存储的内容,作品散,但是,这不是真正的想法,因为页面可能只是有它的时间戳(该网页制作日期,时间)。显然,在这种情况下,内容将在即使没有显著改变每一个请求不同

I know for static pages there is a last-modified header. However, what about dynamic pages? I got Oli's comment that storing a hash of the contents works but that's not really idea because the page might simply have a time stamp on it (the date-time that the page was produced). Clearly, in this case, the content would be different on every single request even though nothing significant has changed.

所以,现在我想将它与百分比changedness。像,5%以上的变化将导致更改的逻辑来运行。

So, now I'm thinking to tie it to a percentage of 'changedness.' Something like, more than 5% changed will cause the 'changed' logic to run.

我很想听听我怎么能可靠地讲,当一个网页发生了变化,以有意义的方式任何想法。

I'd love to hear any ideas on how I can reliably tell when a page has changed, in a meaningful way.

推荐答案

解决方案之一是确定一个动态页面是静态的,你会考虑'改变',如果它们被更新的部分。使用差异工具(下面的例子),以比较原始网页的源文件,以更新页面的源代码。然而,手动确定这些部件的页面不一定会很好地扩展的每一个实例,如果你有超过几十页。

One solution is to determine the parts of a dynamic page that are static that you would consider 'changed' if they are updated. Using a diff tool (example below) to compare the original page source to updated page source. However, determining these parts manually for every instance of a page would not necessarily scale well if you have more than a few dozen pages.

两个思路:

1)使用 HTMLAgilityPack (.NET库的 HTTP://htmlagilitypack.$c$cplex.com/ )来解析页面的DOM,并执行不同的页面元素的计数为存储,previously扫描的页面,它的最近扫描版本。使用您认为满意国旗的改变的公式。一个非常简单的例子是旧版本有8个锚< A> 标签,新人们只要5

1) Use HTMLAgilityPack (.NET Library http://htmlagilitypack.codeplex.com/) to parse the page DOM and perform a count of distinct page elements for both the stored, previously scanned page and the recently scanned version of it. Use a formula that you deem satisfactory to flag a 'change'. A very simple example would be the old copy has 8 anchor <a> tags and the new one only has 5.

2)使用一个版本比较图书馆DiffPlex HTTP://diffplex.$c$cplex.com/ 确定字和线路的变化。您将需要拿出,经过分析,字和线条增加,将引发一个有效的改变的变化底线。

2) Use a diffing library DiffPlex http://diffplex.codeplex.com/ to determine word and line changes. You will need to come up with, through analysis, a change base line for word and line additions that would trigger a valid 'change'.

        var d = new Differ();
        var inlineBuilder = new InlineDiffBuilder(d);
        var result = inlineBuilder.BuildDiffModel(OldText, NewText);
        int inserted, deleted, modified = 0;
        foreach (var line in result.Lines)
        {

            if(line.Type == ChangeType.Inserted)
                inserted++;
            else if(line.Type == ChangeType.Deleted)
               deleted++;
            else if (line.Type == ChangeType.Modified)
                modified++;


        }
        // some base line formula/threshold you come up with through analysis
        if (deleted + inserted + modifed > 10)
           changed = true;
    }

这篇关于如何知道什么时候一个网页已经被X%VB.net中改变了吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆