如何从HTML中剥离无关紧要的空白 [英] How to strip insignificant whitespace out of HTML

查看:124
本文介绍了如何从HTML中剥离无关紧要的空白的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须比较不同版本的HTML页面以进行格式设置和文本更改。不幸的是,创建它们的人/公司使用某种HTML编辑器,每次都重新包装所有的HTML(并增加了大量的空白),这使得很难区分它们。所以我正在寻找一种工具(最好是一个Java库),它可以重新格式化我的HTML,以便删除所有不重要的空格和换行符。

这意味着,在

 < h1>第一标题< / H1> < h2>第二标题< / h2> 

< / h1> < h2> 应该被移除,但是在 $ b < b个格式化< / b个< I>文字< / I>

可能不会删除空格。我不关心< pre> < textarea> < script> ; blocks,也不是关于可以改变行为的CSS空格属性 - 我只是寻找一种解决方案,去除大部分不必要的空白(并且最好留下太多的空白)。

(我已经在折叠多个空格并重新添加换行符而不是空格,以使文本更具可读性 - 但仍然存在太多情况,例如标题或表格单元格/行之间的新换行符会破坏我简单的解决方案。) JTidy 可能在这里有用。它是一个HTML解析器,用于解析HTML(并且能够容忍格式不正确的HTML)并将HTML呈现为DOM,并且您可以覆盖写出来的内容以删除您不感兴趣的内容。


I have to compare different versions of HTML pages for formatting and text changes. Unfortunately the guy/company who creates them uses some kind of HTML editor that re-wraps all the HTML every time (and adds tons of whitespace), which makes it hard to diff them. So I am looking for a tool (preferrably a Java library) that can reformat my HTML in a way that all insignificant spaces and newlines get removed.

That means, in

<h1>First Headline</h1> <h2>Second headline</h2>

the space between </h1> and <h2> should be removed, but in

<b>formatted</b> <i>text</i>

the whitespace may not be removed. I do not care about <pre>, <textarea> or <script> blocks, and also not about CSS whitespace attributes that can change the behavior - I am just looking for a solution that strips most of the unnecessary whitespace (and better leave too much whitespace in than too little).

(I am already collapsing multiple whitespaces and re-adding newlines instead of whitespaces before tags to make the text more readable - but there are still too many cases where for example a new newline between headlines or table cells/rows breaks my simple "solution".)

解决方案

JTidy may be of use here. It's an HTML parser that parses the HTML (and is tolerant of ill-formed HTML) and presents the HTML as a DOM, and you can override the writing out of this to remove whatever you're not interested in.

这篇关于如何从HTML中剥离无关紧要的空白的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆