Java - 读取 XML 并留下所有实体 [英] Java - Read XML and leave all entities alone

查看:79
本文介绍了Java - 读取 XML 并留下所有实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 SAX 或 StAX 读取 XHTML 文件,无论哪种效果最好.但我不希望实体被解析、替换或类似的事情.理想情况下,它们应该保持原样.我不想使用 DTD.

I want to read XHTML files using SAX or StAX, whatever works best. But I don't want entities to be resolved, replaced or anything like that. Ideally they should just remain as they are. I don't want to use DTDs.

这是一个(可执行文件,使用 Scala 2.8.x)示例:

Here's an (executable, using Scala 2.8.x) example:

import javax.xml.stream._
import javax.xml.stream.events._
import java.io._

println("StAX Test - "+args(0)+"\n")
val factory = XMLInputFactory.newInstance
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false)
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false)

println("------")
val xer = factory.createXMLEventReader(new FileReader(args(0)))
val entities = new collection.mutable.ArrayBuffer[String]
while (xer.hasNext) {
    val event = xer.nextEvent
    if (event.isCharacters) {
        print(event.asCharacters.getData)
    } else if (event.getEventType == XMLStreamConstants.ENTITY_REFERENCE) {
        entities += event.asInstanceOf[EntityReference].getName
    }
}
println("------")
println("Entities: " + entities.mkString(", "))

给定以下 xhtml 文件...

Given the following xhtml file ...

<html>
    <head>
        <title>StAX Test</title>
    </head>
    <body>
        <h1>Hallo StAX</h1>
        <p id="html">
            &lt;div class=&quot;header&quot;&gt;
        </p>
        <p id="stuff">
            &Uuml;berdies sollte das hier auch als Copyright sichtbar sein: &#169;
        </p>
        Das war's!
    </body>
</html>

... 运行 scala stax-test.scala stax-test.xhtml 将导致:

... running scala stax-test.scala stax-test.xhtml will result in:

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      <div class="header">


      berdies sollte das hier auch als Copyright sichtbar sein: ?

    Das war's!

------
Entities: Uuml

因此,所有实体或多或少都已成功替换.不过,我所期望的和我想要的是:

So all entities have been replaced more or less sucessfully. What I would have expected and what I want is this, though:

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      &lt;div class=&quot;header&quot;&gt;


      &Uuml;berdies sollte das hier auch als Copyright sichtbar sein: &#169;

    Das war's!

------
Entities: // well, or no entities above and instead:
// Entities: lt, quot, quot, gt, Uuml, #169

这甚至可能吗?我想解析 XHTML,做一些修改,然后再次将它输出为 XHTML.所以我真的希望实体保留在结果中.

Is this even possible? I want to parse XHTML, do some modifications and then output it like that as XHTML again. So I really want the entities to remain in the result.

我也不明白为什么 Uuml 被报告为 EntityReference 事件,而其余的则不是.

Also I don't get why Uuml is reported as an EntityReference event while the rest aren't.

推荐答案

一些术语:&#x169; 是数字字符引用(不是实体),而 &#auml; 是实体引用(不是实体).

A bit of terminology: &#x169; is a numeric character reference (not an entity), and &#auml; is an entity reference (not an entity).

我认为任何 XML 解析器都不会报告对应用程序的数字字符引用——它们总是会被扩展.实际上,您的应用程序不应该关心这一点,就像关心属性之间有多少空白一样.

I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.

至于实体引用,SAX 等低级解析接口会报告实体引用的存在——无论如何,当它们出现在元素内容中时,它会报告它们,但不会在属性内容中报告它们.有一些特殊事件只通知给 LexicalHandler 而不是 ContentHandler.

As for entity references, low-level parse interfaces such as SAX will report the existence of the entity reference - at any rate, it reports them when they occur in element content, but not in attribute content. There are special events notified only to the LexicalHandler rather than to the ContentHandler.

这篇关于Java - 读取 XML 并留下所有实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆