XML 规范和 UTF-16 [英] XML Spec and UTF-16

查看:35
本文介绍了XML 规范和 UTF-16的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

第 4.3.3 节XML 1.0 规范的附录 Fa> 谈论 UTF-16UTF-16 编码数据流中的字节顺序标记 (BOM),以及 XML 编码声明.从这些部分的信息来看,UTF-16 文档中似乎需要字节顺序标记.但是附录 F 中的汇总图表给出了一个场景,其中 UTF-16 输入没有字节顺序标记,但是这个场景有一个 xml 声明.根据第 4.3.3 节,UTF-16 编码的文档不需要编码声明(在这种情况下,XML 声明本身是可选的).

Section 4.3.3 and Appendix F of the XML 1.0 spec speak about UTF-16, the byte order mark (BOM) in UTF-16 encoded data streams, and the XML encoding declaration. From the information in those sections, it would seem that a byte order mark is required in UTF-16 documents. But the summary chart in Appendix F gives a scenario where a UTF-16 input does not have a Byte order mark, but this scenario has an xml declaration. According to section 4.3.3, a UTF-16 encoded document does not require an encoding declaration (and the XML declaration itself is optional in such a case).

根据这些信息,如果文档的其余部分是格式正确的,那么既没有 BOM 也没有 XML 声明且缺少外部提供的编码信息的 UTF-16 xml 文档是否被认为是格式良好的?

Given this information, is a UTF-16 xml document with neither a BOM nor an XML declaration that also lacks externally provided encoding information considered well-formed if the rest of the document is?

推荐答案

来自 Unicode 6.2 规范(第 99 页):

From the Unicode 6.2 specification (page 99):

UTF-16 编码方案可能以 BOM 开头,也可能不以 BOM 开头.但是,当没有 BOM 时,并且没有更高级别的协议时,UTF-16 编码方案的字节顺序是 big-endian.

The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.

因此在 UTF-16 文档中不需要 BOM.但是可能会有一个更高级别的协议",比如 XML 规范,来指明对于没有 BOM 的 UTF-16 XML 文档需要做什么.

So a BOM is not required in a UTF-16 document. But there may be a "higher-level protocol" such as the XML specification to indicate what needs to be done for UTF-16 XML documents without BOM.

XML 1.0 规范中的第 4.3.3 节说:

Section 4.3.3 in the XML 1.0 specification says:

以 UTF-16 编码的实体必须和以 UTF-8 编码的实体可以以 [ISO/IEC 10646:2000] 的附件 H,[Unicode] 的第 16.8 节(零宽度 NO-空格字符,#xFEFF).

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF).

我们稍后再回到上面.附录 F 描述了在 BOM 不存在的情况下检测字符编码的方法.但我认为该部分与您的问题无关,因为您要问的是没有 BOM 和 XML 声明的 UTF-16 XML 文档是否格式良好",而附录 F 是规范的非规范部分.

Let's get back to the above later. Appendix F describes approaches for detecting the character encoding in case a BOM isn't present. But I don't think that section is relevant for your question as you're asking whether a UTF-16 XML document without BOM and without XML declaration is "well-formed" and Appendix F is a non-normative part of the specification.

因此,回到规范,如果总体而言,它与生产标签文档相匹配",则文档是格式良好的.(第 2.1 节).查看 document 表明 XML 声明是可选的(这也在第 2.8 节中提到).所以有可能有一个没有 XML 声明的格式良好的文档;这回答了您的一半问题.

So, going back to the specification, a document is well-formed if "Taken as a whole, it matches the production labeled document." (Section 2.1). Reviewing document shows that the XML declaration is optional (this is also mentioned in Section 2.8). So it's possible to have a well-formed document without a XML declaration; this answers half of your question.

另一半是没有XML声明但没有BOM的UTF-16 XML文档是否仍然可以是良构的.在第 4.3.3 节中它说(强调我的):

The other half is whether a UTF-16 XML document without XML declaration but also without BOM can still be well-formed. In Section 4.3.3 it says (emphasis mine):

在没有外部传输协议(例如 HTTP 或 MIME)提供的信息的情况下,包含编码声明的实体以声明中指定的编码以外的编码呈现给 XML 处理器是致命错误, 或者对于既不以字节顺序标记也不以编码声明开头的实体使用 UTF-8 以外的编码.

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.

基于此,没有 BOM 且没有编码声明(这是 XML 声明的一部分)的 UTF-16 XML 文档不是格式良好的文档(因为致命错误违反了格式良好,请参阅格式良好的定义1.2 节中的约束)在没有外部信息的情况下.这也与之前在第 4.3.3 节中所说的关于 UTF-16 的 BOM 要求一致.

Based on this a UTF-16 XML document without BOM and without encoding declaration (which is part of the XML declaration) is not a well-formed document (because a fatal error violates wellformed-ness, see definition of well-formedness constraint in Section 1.2) in the absence of external information. This also matches what was said earlier in Section 4.3.3 about the requirement of a BOM for UTF-16.

这篇关于XML 规范和 UTF-16的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆