XML 声明中的默认编码 (UTF-8) 的默认值如何? [英] How default is the default encoding (UTF-8) in the XML Declaration?

查看:31
本文介绍了XML 声明中的默认编码 (UTF-8) 的默认值如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道XML 的默认编码是UTF-8.所有 XML 使用者必须,依此类推.因此,这不仅仅是 XML 是否具有默认编码的问题.

I know that the default encoding of XML is UTF-8. All XML consumers MUST and so on and so forth. So this is not just a question whether or not XML has a default encoding.

我也知道XML-Declarataion <?xml version="1.0" ... ?> 在文档本身的开头是可选的.并且在其中指定编码也是可选的.

I also know that the XML-Declarataion <?xml version="1.0" ... ?> at the beginning of the document itself is optional. And that specifying the encoding therein is optional as well.

所以我问自己以下两个 XML 声明是否是完全相同的两个表达式:

So I ask myself if the following two XML-Declarations are two expressions for the exact same thing:

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>

根据我目前的理解,我会说这些是等效的,但我不知道.是否在某处指定了这两个声明的等效性?

From my own current understanding I would say those are equivalent but I do not know. Has the equivalence of these two declarations been specified somewhere?

(考虑这两个示例行都​​是 XML 文档的第一行,前面是任何(零)字节并且是 UTF-8 编码)

(Consider these two example lines being each the first line of an XML document, preceded by any (zero) bytes and being UTF-8 encoded)

推荐答案

简答

在没有外部编码信息的 UTF-8 编码文档的非常特殊的情况下(我从评论中了解到这是您感兴趣的),这两个声明之间没有区别.

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you're interested in), there is no difference between the two declarations.

不过,长答案更有趣.

规范说明

如果您查看 XML 规范的附录 F1,解释了在没有外部编码信息的情况下确定编码应该遵循的过程.

If you look at Appendix F1 of the XML specification, that explains the process that should be followed to determine the encoding when there is no external encoding information.

如果文档被编码为 UTF 变体之一,解析器应该能够检测到前 4 个字节内的编码,无论是从字节顺序标记,还是从 XML 声明的开始.

If the document is encoded as one of the UTF variants, the parser should be able to detect the encoding within the first 4 bytes, either from the Byte Order Mark, or the start of the XML declaration.

但是,根据规范,它仍然应该读取编码声明.

However, according to the spec, it should still read the encoding declaration.

在上面不需要读取编码声明来确定编码的情况下,第 4.3.3 节仍然要求读取编码声明(如果存在)并检查编码名称以匹配实体的实际编码.

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity.

如果它们不匹配,根据第 4.3.3 节:

If they don't match, according to section 4.3.3:

...对于包含编码声明的实体以声明中指定的编码以外的编码呈现给 XML 处理器是致命错误

...it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

编码的 UTF-16,声明的 UTF-8

让我们看看当我们创建一个编码为 UTF-16 但编码声明设置为 UTF-8 的 XML 文档时实际会发生什么.

Let's see what happens in reality when we create an XML document encoded as UTF-16 but with the encoding declaration set to UTF-8.

Opera、Firefox 和 Chrome 都将文档解释为 UTF-16,忽略编码声明.Internet Explorer(至少版本 9)显示空白文档,但没有实际错误.

Opera, Firefox and Chrome all interpret the document as UTF-16, ignoring the encoding declaration. Internet Explorer (version 9 at least), displays a blank document, but no actual error.

因此,如果您在 UTF-8 文档中包含 UTF-8 编码声明,并且稍后有人将其转换为 UTF-16,则它可以在大多数浏览器中使用,但在 IE 中会失败(而且,我假设,大多数 Microsoft XML API).如果你不使用编码声明,你会没事的.

So if you include a UTF-8 encoding declaration on your UTF-8 document and someone at a later stage converts it to UTF-16, it'll work in most browsers, but fail in IE (and, I assume, most Microsoft XML APIs). If you had left the encoding declaration off, you would have been fine.

从技术上讲,我认为 IE 是最准确的.它不显示错误的事实可能是因为错误发生在编码级别而不是 XML 级别.假定已尽力将 UTF-16 字符解释为 UTF-8,但未能找到任何可解码的字符,并最终将空字符序列传递给 XML 解析器.

Technically I think IE is the most accurate. The fact that it doesn't display an error as such might be explained by the fact that the error is occurring at the encoding level rather than the XML level. It is assumedly doing its best to interpret the UTF-16 characters as UTF-8, failing to find any characters that decode, and ending up passing on an empty character sequence to the XML parser.

编码的 UTF-8,否则声明

您现在可能认为 Firefox、Chrome 和 Opera 只是完全忽略了编码声明,但情况并非总是如此.

You might now think that Firefox, Chrome and Opera are just ignoring the encoding declaration altogether, but that's not always the case.

如果您将文档编码为 UTF-8(带有字节顺序标记,因此它不会与其他任何内容混淆),但将编码声明设置为 Latin1,所有浏览器都将成功将内容解码为 Latin1,忽略 UTF-8 物料清单.

If you encode a document as UTF-8 (with a byte order marker so it's unmistakable as anything else), but set the encoding declaration to Latin1, all of the browsers will successfully decode the content as Latin1, ignoring the UTF-8 BOM.

这在我看来是正确的.BOM 字符在 Latin1 中无效这一事实仅意味着它们在字符解码级别被悄悄丢弃.

Again this seems right to me. The fact that the BOM characters aren't valid in Latin1 just means they are silently dropped at the character decoding level.

但这并不适用于 UTF-8 文档中所有声明的编码.如果声明的编码是 UTF-16,我们将返回 Opera、Firefox 和 Chrome 忽略声明的编码,而 Internet Explorer 返回一个空白文档.

This doesn't work for all declared encodings on a UTF-8 document though. If the declared encoding is UTF-16, we're back with Opera, Firefox and Chrome ignoring the declared encoding, while Internet Explorer returns a blank document.

本质上,任何使 IE 返回空白文档的行为都会使其他浏览器忽略声明的编码.

Essentially, anything that makes IE return a blank document is going to make other browsers ignore the declared encoding.

其他不一致

还值得一提的是字节顺序标记的重要性.根据规范第 4.3.3 节:

It's also worth mentioning the importance of the Byte Order Mark. According to section 4.3.3 of the spec:

以 UTF-16 编码的实体必须 [...] 以字节顺序标记开头

Entities encoded in UTF-16 MUST [...] begin with the Byte Order Mark

但是,如果您尝试读取没有 BOM 的 UTF-16 编码的 XML 文档,大多数浏览器仍然会接受它为有效.只有 Firefox 将其报告为 XML 解析错误.

However, if you try and read a UTF-16 encoded XML document without a BOM, most browsers will nevertheless accept it as valid. Only Firefox reports it as an XML Parsing Error.

外部编码信息

到目前为止,我们一直在考虑没有外部编码信息时会发生什么,但是,正如其他人提到的,如果文档是通过 HTTP 接收的或包含在某种 MIME 信封中,则来自这些来源应该优先于文档编码.

Up to now, we've been considering what happens when there is no external encoding information, but, as others have mentioned, if the document is received via HTTP or enclosed in a MIME envelope of some sort, the encoding information from those sources should take preference over the document encoding.

各种 XML MIME 类型的大部分详细信息都在 RFC3023 中进行了描述.然而,实际情况与指定的有些不同.

Most of the details for the various XML MIME types are described in RFC3023. However, the reality is somewhat different from what is specified.

首先,带有省略字符集参数的 text/xml 应该使用 US-ASCII 字符集,但该要求几乎总是被忽略.浏览器通常会使用 XML 编码声明的值,如果没有,则默认使用 UTF-8.

First of all, text/xml with an omitted charset parameter should use a charset of US-ASCII, but that requirement has almost always been ignored. Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none.

其次,如果文档上有 UTF-8 BOM,并且 XML 编码声明要么是 UTF-8 要么不包含,无论 Content 中使用的字符集如何,文档都会被解释为 UTF-8-输入.

Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.

来自 Content-Type 的编码似乎优先的唯一时间是没有 BOM 并且在 Content-Type 中指定了显式字符集.

The only time the encoding from the Content-Type seems to take precedence is when there is no BOM and an explicit charset is specified in the Content-Type.

无论如何,在 UTF-8 文档中包含 UTF-8 XML 编码声明与根本没有编码声明有任何不同的情况(涉及 Content-Type).

In any event, there are no cases (involving Content-Type) where including a UTF-8 XML encoding declaration on a UTF-8 document is any different from not having an encoding declaration at all.

这篇关于XML 声明中的默认编码 (UTF-8) 的默认值如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆