修复HTML中的未封闭标签或使用HTML分析器解析XSLT转换 [英] Fix unclosed tags in html or parse with HTML parser for XSLT transformation
问题描述
我有一些HTML代码是XSLT转换的结果。 (XML-> HTML)
我想对结果HTML运行另一个XSLT转换。 (HTML-> HTML)
我的问题是,第一次转换可能会返回未封闭的标签,如< img>
,这意味着我不能用DocumentBuilder解析结果html,因为它使用SAXparser,当然我的html文件在所有情况下都不是有效的xml。 (我得到一个例外,下面的XY标签必须关闭。)
我猜有两个解决方案。
-
通过关闭未关闭的标签来修复HTML结果。
使用某种HTML解析器获取有效的org.w3c.dom.Document并跳过像SAX这样的XML解析器。 我真的很想使用我用于第一次转换的方法相同,所以我希望上述问题之一的解决方案是我找不到任何明显的第三方罐子,可以提供帮助。 (虽然我看了。)所以基本上我想知道我的选择是什么,有什么解决这个问题的? Either fix the result HTML by closing the unclosed tags.
Use some kind of HTML parser to get a valid org.w3c.dom.Document and skip XML parsers like SAX.
任何帮助将不胜感激。 / p>
,以确保所有文档格式正确。
...用Java
编写的符合SAX的解析器,而不是解析格式正确的
或有效的XML,解析HTML,因为它是在野外发现的
:穷人,讨厌和
残酷,尽管通常远不及
短。
TagSoup专为人们
而设计,他们必须使用
来处理这些东西,这些东西使用一些理性的
应用程序设计。通过提供SAX
接口,它允许将标准的XML
工具应用于甚至最差的
HTML。 TagSoup还包含一个
命令行处理器,用于读取HTML
文件,并且可以生成干净的
HTML或格式良好的XML,这些XML是与XHTML近似的
近似值。
...用Java
编写的符合SAX的解析器,而不是解析格式正确的
或有效的XML,解析HTML,因为它是在野外发现的
:穷人,讨厌和
残酷,尽管通常远不及
短。
如果您使用的是撒克逊,你可以通过添加以下选项来使TagSoup成为解析器:
...你在确认TagSoup位于
后,可以使用标准Saxon-x
选项,
org.ccil.cowan.tagsoup.Parser
您的Java类路径。
我已经使用它来一次性分析和转换HTML文档,并发现它很棒。它会将该文档作为一个格式良好的XHTML文档来读取,这些文档可以通过XML工具进行操作和转换。
I have some HTML code that is the result of an XSLT tranformation. (XML->HTML)
I want to run another XSLT transformation on the result HTML. (HTML->HTML)
My problem is that the first transformation may return unclosed tags like "<img>
", which means that i can't parse the result html with DocumentBuilder because it uses SAXparser and of course my html file is not a valid xml in all cases. (I get an exception that the following XY tag must be closed.)
I guess there are two solutions.
I would really like to use mainly the same method I used for the first transformation, so I would prefer one of the solutions above the problem is that I can't find any obvious 3rd party jars that can help. (Though i looked.) So basically I would like to know what are my options here, are there any solutions to this problem?
Any help would be greatly appreciated.
TagSoup - Just Keep On Truckin'
You could use TagSoup to ensure that all of the documents are well-formed.
...a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.
TagSoup is designed for people who have to process this stuff using some semblance of a rational application design.
By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
If you are using Saxon, you can make TagSoup your parser by adding the following option:
...you can use the standard Saxon
-x org.ccil.cowan.tagsoup.Parser
option, after making sure that TagSoup is on your Java classpath.
I have used this to parse and transform HTML documents in a single pass and have found that it works great. It will read the document as a well-formed XHTML document available to be manipulated and transformed through XML tools.
Also, Taggle, a TagSoup in C++, available now
这篇关于修复HTML中的未封闭标签或使用HTML分析器解析XSLT转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!