在不改变XML的情况下,在Java中解析包含HTML实体的XML文件 [英] Parsing XML file containing HTML entities in Java without changing the XML

查看:147
本文介绍了在不改变XML的情况下,在Java中解析包含HTML实体的XML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须用Java解析一堆XML文件,有时 - 并且无效地 - 包含HTML实体,例如& mdash; & gt; 等等。我理解处理这个问题的正确方法是在解析之前向XML文件添加合适的实体声明。但是,我无法做到这一点,因为我无法控制这些XML文件。

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.

我是否可以覆盖某种类型的回调,只要Java XML解析器调用该回调遇到这样的实体?我无法在API中找到一个。

Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.

我想使用:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

DocumentBuilder parser = dbf.newDocumentBuilder();
Document        doc    = parser.parse( stream );

我发现我可以覆盖 resolveEntity in org.xml.sax.helpers.DefaultHandler ,但如何将其与更高级别的API一起使用?

I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?

这是一个完整的示例:

public class Main {
    public static void main( String [] args ) throws Exception {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder parser = dbf.newDocumentBuilder();
        Document        doc    = parser.parse( new FileInputStream( "test.xml" ));
    }

}

with test.xml:

with test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>Some&nbsp;text &mdash; invalid!</bar>
</foo>

产生:

[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.

更新:我一直在调试JDK源代码,和男孩,什么是意大利面条的数量。我不知道那里的设计是什么,或者是否有设计。洋葱的层数可以叠加多少层?

Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?

它们的关键类似乎是 com.sun.org.apache .xerces.internal.impl.XMLEntityManager ,但我找不到任何代码可以让我在使用它之前添加东西,或者尝试解析实体而不通过该类。

They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.

推荐答案

为此,我会使用像Jsoup这样的库。我在下面测试了以下内容并且它有效。我不知道这是否有帮助。它可以位于: http://jsoup.org/download

I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download

public static void main(String args[]){


    String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" + 
                  "<bar>Some&nbsp;text &mdash; invalid!</bar></foo>";
    Document doc = Jsoup.parse(html, "", Parser.xmlParser());

    for (Element e : doc.select("bar")) {
        System.out.println(e);
    }   


}

结果:

<bar>
 Some&nbsp;text — invalid!
</bar>

从文件加载可以在这里找到:

Loading from a file can be found here:

http://jsoup.org/cookbook/input/load-document-from-file

这篇关于在不改变XML的情况下,在Java中解析包含HTML实体的XML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆