在许多HTML文档之间找到文本块之间的相似性? [英] Find Similarities between Blocks of Text between Many HTML Documents?

查看:85
本文介绍了在许多HTML文档之间找到文本块之间的相似性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我说20个HTML页面,并且想提取文档的共享/相似部分,那么有哪些有效的方法呢?

If I have say 20 HTML pages and I want to extract out the shared/similar portions of the documents, what are some efficient ways to do that?

所以例如,对于StackOverflow,比较10个页面,我发现每个页面的顶部栏和主菜单栏都相同,因此我可以将它们提取出来。

So say for StackOverflow, comparing 10 pages I'd find that the top bar and the main menu bar are the same across each page, so I could extract them out.

它似乎我需要diff程序或一些复杂的正则表达式,但假设我事先对页面/文本/ html结构没有任何了解。

It seems like I'd need either a diff program or some complex regexps, but assume I don't have any knowledge of the page/text/html structure beforehand.

推荐答案

您应该考虑使用克隆检测器,例如CloneDR 。好的格式一次比较成千上万个文件的结构而与格式无关,并会告诉您什么文件具有公共元素以及这些公共元素如何变化。

You should consider a clone detector such as CloneDR. Good ones compare the structure of thousands of files at once regardless of the formatting, and will tell you what the files have as common elements and how those common elements vary.

CloneDR已经应用于许多编程语言。它的基础DMS Software Reengeering Toolkit已经可以处理(肮脏的)HTML,因此构建HMTL CloneDR非常容易。

CloneDR has been applied to many programming langauges. Its foundation, the DMS Software Reengeering Toolkit, already handles (dirty) HTML, so it would be pretty easy to build an HMTL CloneDR.

这篇关于在许多HTML文档之间找到文本块之间的相似性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆