Java DOM转换和解析具有无效XML字符的任意字符串? [英] Java DOM transforming and parsing arbitrary strings with invalid XML characters?

查看:106
本文介绍了Java DOM转换和解析具有无效XML字符的任意字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我想提一下,这不是如何解析无效(格式错误/格式错误)的XML的副本?是因为我没有给定的无效(或格式不正确的)XML文件,而是给定的任意Java String ,其中可能包含也可能不包含无效的XML字符。我想创建一个包含 Text 节点和给定 String Document c $ c>,然后将其转换为文件。当文件解析为DOM Document 时,我想获取一个 String ,它等于初始给定的字符串。我用 org.w3c.dom.Document#createTextNode(String data)创建 Text 节点,我得到了字符串使用 org.w3c.dom.Node#getTextContent()

First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don't have a given invalid (or not well-formed) XML file but rather a given arbitrary Java String which may or may not contain an invalid XML character. I want to create a DOM Document containing a Text node with the given String, then transform it to a file. When the file is parsed to a DOM Document I want to get a String which is equal to the initial given String. I create the Text node with org.w3c.dom.Document#createTextNode(String data) and I get the String with org.w3c.dom.Node#getTextContent().

如您在https://stackoverflow.com/a/28152666/3882565 某些 Text XML文件中的节点。实际上, Text 节点有两种不同类型的无效字符。有预定义的实体,例如 & '< &,它们会由DOM API使用& quot; & & ,<$ c $结果文件中的c>& lt; & gt; ,该文件在解析文件后由DOM API撤消。问题是其他无效字符(例如'\u0000''\uffff'。解析文件时会出现异常,因为'\u0000''\uffff'是无效字符。

As you can see in https://stackoverflow.com/a/28152666/3882565 there are some invalid characters for Text nodes in a XML file. Actually there are two different types of "invalid" characters for Text nodes. There are predefined entities such as ", &, ', < and > which are automatically escaped by the DOM API with &quot;, &amp;, &apos;, &lt; and &gt; in the resulting file which is undone by the DOM API when the file is parsed. Now the problem is that this is not the case for other invalid characters such as '\u0000' or '\uffff'. An exception occurs when parsing the file because '\u0000' and '\uffff' are invalid characters.

可能我必须实现一种方法,该方法以一种独特的方式在给定的 String 中转义那些字符。它返回到DOM API,并在以后返回 String 时撤消,对吗?有没有更好的方法呢?有人在过去实现这些方法或类似方法吗? ?

Probably I have to implement a method which escapes those characters in the given String in a unique way before submitting it to the DOM API and undo that later when I get the String back, right? Is there a better way to do this? Did someone implement those or similar methods in the past?

Ed :该问题被标记为在Java中为XML编码文本数据的最佳方法的重复方式?。我现在已经阅读了所有答案,但是没有一个能解决我的问题。所有答案都表明:

This question was marked as duplicate of Best way to encode text data for XML in Java?. I have now read all of the answers but none of them solves my problem. All of the answers suggest:


  • 使用我已经做过的XML库(例如DOM API),并且这些库都没有实际替换无效字符除了 & '< > 等。

  • & #number; 会导致无效字符(例如&#0; )的异常

  • 使用带有XML编码方法的第三方库,该方法不支持非法字符,例如&#0; (在某些库中被跳过)。

  • 使用也不支持无效字符的CDATA部分。

  • Using a XML library such as the DOM API which I already do and none of those libraries actually replaces invalid characters except ", &, ', <, > and a few more.
  • Replacing all invalid characters by "&#number;" which results in an exception for invalid characters such as "&#0;" when parsing the file.
  • Using a third party library with an XML encode method which do not support illegal characters such as "&#0;" (they are skipped in some libraries).
  • Using a CDATA section which doesn't support invalid characters either.

推荐答案

正如@VGR和@kjhughes在问题下方的注释中指出的那样,Base64确实是我问题的可能答案。我的问题的进一步解决方案是基于转义。我已经编写了两个函数 escapeInvalidXmlCharacters(String string) unescapeInvalidXmlCharacters(String string)可以按以下方式使用

As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string) which can be used in the following way.

    String string = "text#text##text#0;text" + '\u0000' + "text<text&text#";
    Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
    Element element = document.createElement("element");
    element.appendChild(document.createTextNode(escapeInvalidXmlCharacters(string)));
    document.appendChild(element);
    TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(new File("test.xml")));
    // creates <?xml version="1.0" encoding="UTF-8" standalone="no"?><element>text##text####text##0;text#0;text&lt;text&amp;text##</element>
    document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
    System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
    // prints true

escapeInvalidXmlCharacters(String string) unescapeInvalidXmlCharacters(字符串)

/**
 * Escapes invalid XML Unicode code points in a <code>{@link String}</code>. The
 * DOM API already escapes predefined entities, such as {@code "}, {@code &},
 * {@code '}, {@code <} and {@code >} for
 * <code>{@link org.w3c.dom.Text Text}</code> nodes. Therefore, these Unicode
 * code points are ignored by this function. However, there are some other
 * invalid XML Unicode code points, such as {@code '\u0000'}, which are even
 * invalid in their escaped form, such as {@code "&#0;"}.
 * <p>
 * This function replaces all {@code '#'} by {@code "##"} and all Unicode code
 * points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
 * [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the <code>{@link String}</code>
 * {@code "#c;"}, where <code>c</code> is the Unicode code point.
 * 
 * @param string the <code>{@link String}</code> to be escaped
 * @return the escaped <code>{@link String}</code>
 * @see <code>{@link #unescapeInvalidXmlCharacters(String)}</code>
 */
public static String escapeInvalidXmlCharacters(String string) {
    StringBuilder stringBuilder = new StringBuilder();

    for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(i);

        if (codePoint == '#') {
            stringBuilder.append("##");
        } else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
            stringBuilder.appendCodePoint(codePoint);
        } else {
            stringBuilder.append("#" + codePoint + ";");
        }
    }

    return stringBuilder.toString();
}

/**
 * Unescapes invalid XML Unicode code points in a <code>{@link String}</code>.
 * Makes <code>{@link #escapeInvalidXmlCharacters(String)}</code> undone.
 * 
 * @param string the <code>{@link String}</code> to be unescaped
 * @return the unescaped <code>{@link String}</code>
 * @see <code>{@link #escapeInvalidXmlCharacters(String)}</code>
 */
public static String unescapeInvalidXmlCharacters(String string) {
    StringBuilder stringBuilder = new StringBuilder();
    boolean escaped = false;

    for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(i);

        if (escaped) {
            stringBuilder.appendCodePoint(codePoint);
            escaped = false;
        } else if (codePoint == '#') {
            StringBuilder intBuilder = new StringBuilder();
            int j;

            for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
                codePoint = string.codePointAt(j);

                if (codePoint == ';') {
                    escaped = true;
                    break;
                }

                if (codePoint >= 48 && codePoint <= 57) {
                    intBuilder.appendCodePoint(codePoint);
                } else {
                    break;
                }
            }

            if (escaped) {
                try {
                    codePoint = Integer.parseInt(intBuilder.toString());
                    stringBuilder.appendCodePoint(codePoint);
                    escaped = false;
                    i = j;
                } catch (IllegalArgumentException e) {
                    codePoint = '#';
                    escaped = true;
                }
            } else {
                codePoint = '#';
                escaped = true;
            }
        } else {
            stringBuilder.appendCodePoint(codePoint);
        }
    }

    return stringBuilder.toString();
}

请注意,这些函数的效率可能很低,可以用更好的方式编写。随时发布建议以改进注释中的代码。

Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.

这篇关于Java DOM转换和解析具有无效XML字符的任意字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆