XML 中真正允许的字符 [英] Characters really allowed in XML

查看:33
本文介绍了XML 中真正允许的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于解析某些(据说)XML 数据时出现一些解析器错误,我查看了 XML 标准d 以确定什么是真正允许的.我目前的疑虑是关于允许进入标签 <bla>some content</bla> 的内容,即允许什么 some content包含.

我在第 2.4 节中有:

CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

表示不包含 <&]]> 的每个字符序列.否定 [^<&] 是否实际操作的字符集?它是完整的 Unicode 范围(afaik #0x0000 到任何),还是 Char第 2.2 节中的定义:

Char ::= #x9 |#xA |#xD |[#x20-#xD7FF] |[#xE000-#xFFFD] |[#x10000-#x10FFFF]

在这种情况下,需要在内容中转义相当多的字符?

解决方案

我们友好的维基百科有一节专门讨论这个问题.我认为解释要简单得多.http://en.wikipedia.org/wiki/XML#Escaping

有效字符

<块引用>

以下范围内的 Unicode 代码点在 XML 1.0 中有效文件:[9]

  • U+0009、U+000A、U+000D:这些是 XML 中唯一接受的 C0 控件1.0;
  • U+0020–U+D7FF、U+E000–U+FFFD:这排除了 BMP 中的一些(不是全部)非字符(所有代理、U+FFFE 和 U+FFFF 都是禁止);
  • U+10000–U+10FFFF:这包括所有代码点补充平面,包括非字符.

XML 1.1[10] 扩展了允许的字符集以包括所有以上,加上 U+0001–U+001F 范围内的其余字符.在但同时,它限制了 C0 和 C1 控制的使用U+0009、U+000A、U+000D 和 U+0085 以外的字符,要求它们以转义形式写入(例如 U+0001 必须写入作为 或其等价物).在 C1 字符的情况下,这限制是向后不兼容;它被引入以允许要检测的常见编码错误.

代码点 U+0000 是唯一不允许的字符任何 XML 1.0 或 1.1 文档.

逃脱

<块引用>

XML 提供转义工具,用于包含以下字符直接包含有问题.例如:

  • 字符<"和&"是关键的语法标记,可能永远不会出现在 CDATA 部分之外的内容中.[13]
  • 某些字符编码仅支持 Unicode 的一个子集.例如,用 ASCII 编码 XML 文档是合法的,但 ASCII缺少 Unicode 字符的代码点,例如é".
  • 可能无法在作者的机器上输入字符.
  • 某些字符的字形在视觉上无法与其他字符区分开来:例如

    • 不间断空格 ( ) " "

    • 比较空格 ( ) " "

    • 西里尔大写字母 A (А) "А"

    • 比较拉丁文大写字母 A (A) "A"

有五个预定义实体:

  • <代表<"
  • >代表>"
  • &代表&"
  • ’代表'
  • "代表

所有允许的 Unicode 字符都可以用数字表示字符参考.考虑汉字中",它的数字Unicode 中的代码是十六进制 4E2D,或十进制 20,013.一个用户键盘没有提供输入此字符的方法仍然可以将其插入到编码为中或 中 的 XML 文档中.类似地,可以对字符串I <3 Jörg"进行编码以包含在XML 文档为I <3 Jörg".

" " 是不允许的,但是,因为空字符是其中之一从 XML 中排除的控制字符,即使使用数字字符参考. [14]一种替代的编码机制,例如需要使用 Base64 来表示此类字符.

Due to some parser error when parsing certain (supposedly) XML data, I had a look at the XML standard to figure out what is really allowed. My current qualms are with regard to what is allowed to go into the content of a tag <bla>some content</bla>, i.e. what some content is allowed to contain.

I have in section 2.4:

CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

which means "every sequence of characters that does not contain <, &, or ]]>. But on which character set does the negation [^<&] actually operate? Is it the full Unicode range (afaik #0x0000 up to whatever), or is it rather the Char definition from section 2.2:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

in which case quite a bunch of characters would need to be escaped in the content?

解决方案

Our friendly wikipedia has a section devoted to this. I think the explanation is in much easier terms. http://en.wikipedia.org/wiki/XML#Escaping

Valid Characters

Unicode code points in the following ranges are valid in XML 1.0 documents:[9]

  • U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0;
  • U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden);
  • U+10000–U+10FFFF: this includes all code points in supplementary planes, including non-characters.

XML 1.1[10] extends the set of allowed characters to include all the above, plus the remaining characters in the range U+0001–U+001F. At the same time, however, it restricts the use of C0 and C1 control characters other than U+0009, U+000A, U+000D, and U+0085 by requiring them to be written in escaped form (for example U+0001 must be written as  or its equivalent). In the case of C1 characters, this restriction is a backwards incompatibility; it was introduced to allow common encoding errors to be detected.

The code point U+0000 is the only character that is not permitted in any XML 1.0 or 1.1 document.

Escaping

XML provides escape facilities for including characters which are problematic to include directly. For example:

  • The characters "<" and "&" are key syntax markers and may never appear in content outside a CDATA section.[13]
  • Some character encodings support only a subset of Unicode. For example, it is legal to encode an XML document in ASCII, but ASCII lacks code points for Unicode characters such as "é".
  • It might not be possible to type the character on the author's machine.
  • Some characters have glyphs that cannot be visually distinguished from other characters: examples are

    • non-breaking space ( ) " "

    • compare space ( ) " "

    • Cyrillic Capital Letter A (А) "А"

    • compare Latin Capital Letter A (A) "A"

There are five predefined entities:

  • < represents "<"
  • > represents ">"
  • & represents "&"
  • ' represents '
  • " represents "

All permitted Unicode characters may be represented with a numeric character reference. Consider the Chinese character "中", whose numeric code in Unicode is hexadecimal 4E2D, or decimal 20,013. A user whose keyboard offers no method for entering this character could still insert it in an XML document encoded either as 中 or 中. Similarly, the string "I <3 Jörg" could be encoded for inclusion in an XML document as "I <3 Jörg".

"�" is not permitted, however, because the null character is one of the control characters excluded from XML, even when using a numeric character reference.[14] An alternative encoding mechanism such as Base64 is needed to represent such characters.

这篇关于XML 中真正允许的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆