在Java中是否有验证的HTML解析器? [英] Is there a validating HTML parser implemented in Java?

查看:893
本文介绍了在Java中是否有验证的HTML解析器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要用Java解析HTML 4。
理想情况下,我想要一个与SAX兼容的实现。



我知道Java中有许多HTML解析器,但它们都是进行整理。换句话说,他们将纠正格式不正确的HTML。

我的要求是:


  1. 没有整理。

  2. 如果输入文档无效,HTML解析应该失败。

  3. 文档应该可以针对HTML DTD进行验证。

  4. 解析器可以生成SAX2事件。

满足这些要求

解决方案

我认为 Jericho HTML Parser 至少可以满足您的一项核心要求('如果输入文档无效HTML解析应该失败'),因为它至少会告诉你是否存在不匹配的标签或其他有毒的HTML缺陷,并且可以根据这些信息选择失败。



尝试输入无效的html插入到这个Jericho格式化演示中,并注意页面底部的'Parser Log':

http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp



所以是的,这个做标签整理,但它至少告诉你 - 你可以通过设置一个net.htmlparser来获取这些信息.jericho.Logger(例如 WriterLogger 或更具体的您自己的创作)放在您的来源上,然后根据注销的错误进行操作。这是一个小例子:

  Source source = new Source(< a>我忘了关闭我的链接!) ; 
source.setLogger(myListeningLogger);

source.getSourceFormatter()。writeTo(new NullWriter());
// myListeningLogger现在已经写入了所有的HTML漏洞

在上面的例子中,你的记录器的info()方法用字符串调用:' StartTag at(r1,c1,p0)缺少必需的结束标记',这是相对可分析的,你可以总是决定拒绝任何记录比调试更糟的消息的HTML - 事实上,Jericho几乎将所有错误都记录为信息级别,并且有几个级别处于警告级别(您可能会试图创建一个调整严重级别的小分支以符合你所关心的)。



杰里科在Maven Central上可用,这总是一个好兆头:



http://mvnrepository.com/artifact/net.htmlparser.jericho/ jericho-html



祝你好运!


I need to parse HTML 4 in Java. Ideally I'd like an implementation that is SAX compatible.

I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.

My requirements are:

  1. No tidying.
  2. If the input document is invalid HTML parsing should fail.
  3. The document should be validatable against the HTML DTDs.
  4. The parser can produce SAX2 events.

Is there a library that meets these requirements?

解决方案

I think the Jericho HTML Parser can deliver at least one of your core requirements ('If the input document is invalid HTML parsing should fail.') in that it will at least tell you if there are mismatched tags or other poisonous HTML flaws, and you can choose to fail based on this information.

Try typing invalid html into this Jericho formatting demo, and note the 'Parser Log' at the bottom of the page:

http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp

So yes, this is doing tag tidying, but it is at least telling you about it - you can grab this information by setting a net.htmlparser.jericho.Logger (e.g. a WriterLogger or something more specific of your own creation) on your source, and then proceeding depending on what errors are logged out. This is a small example:

    Source source=new Source("<a>I forgot to close my link!");
    source.setLogger(myListeningLogger);

    source.getSourceFormatter().writeTo(new NullWriter());
    // myListeningLogger has now had all the HTML flaws written to it

In the example above, your logger's info() method is called with the string: 'StartTag at (r1,c1,p0) missing required end tag', which is relatively parseable, and you can always decide to just reject any HTML that logs any message worse than debug - in fact Jericho logs almost all errors as 'info' level, with a couple at 'warn' level (you might be tempted to create a small fork with the severities adjusted to correspond to what you care about).

Jericho is available on Maven Central, which is always a good sign:

http://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html

Good luck!

这篇关于在Java中是否有验证的HTML解析器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆