XML声明中的默认编码(UTF-8)如何默认? [英] How default is the default encoding (UTF-8) in the XML Declaration?

查看:460
本文介绍了XML声明中的默认编码(UTF-8)如何默认?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 XML的默认编码为UTF-8 .所有XML使用者必须(MUST)等等.因此,这不仅仅是XML是否具有默认编码的问题.

I know that the default encoding of XML is UTF-8. All XML consumers MUST and so on and so forth. So this is not just a question whether or not XML has a default encoding.

我也知道 XML声明<?xml version="1.0" ... ?> <文件本身开头的/a>是可选的.并且在其中指定编码也是可选的.

I also know that the XML-Declarataion <?xml version="1.0" ... ?> at the beginning of the document itself is optional. And that specifying the encoding therein is optional as well.

所以我问自己以下两个XML声明是否是完全相同的两个表达式:

So I ask myself if the following two XML-Declarations are two expressions for the exact same thing:

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>

根据我目前的理解,我会说那是等效的,但是我不知道. 是否在某处指定了这两个声明的等效项?

From my own current understanding I would say those are equivalent but I do not know. Has the equivalence of these two declarations been specified somewhere?

(考虑这两个示例行,每个行都是XML文档的第一行,后跟任意(零)字节,并经过UTF-8编码)

(Consider these two example lines being each the first line of an XML document, preceded by any (zero) bytes and being UTF-8 encoded)

推荐答案

简短答案

在没有外部编码信息的UTF-8编码文档的非常特殊的情况下(我从注释中了解到您感兴趣的是该信息),两个声明之间没有区别.

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you're interested in), there is no difference between the two declarations.

长答案显然更有趣.

规范怎么说

如果您查看 XML规范的附录F1,,它说明了在没有外部编码信息的情况下确定编码所应遵循的过程.

If you look at Appendix F1 of the XML specification, that explains the process that should be followed to determine the encoding when there is no external encoding information.

如果文档被编码为UTF变体之一,则解析器应该能够从字节顺序标记或XML声明的开始检测前4个字节内的编码.

If the document is encoded as one of the UTF variants, the parser should be able to detect the encoding within the first 4 bytes, either from the Byte Order Mark, or the start of the XML declaration.

但是,根据规范,它仍应读取编码声明.

However, according to the spec, it should still read the encoding declaration.

在上述情况下,不需要读取编码声明来确定编码,第4.3.3节仍要求读取编码声明(如果存在),并检查编码名称以匹配实体的实际编码.

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity.

如果它们不匹配,请根据第4.3.3节:

If they don't match, according to section 4.3.3:

...对于包含编码声明的实体,使用声明以外的其他编码形式呈现给XML处理器,这是一个致命错误

...it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

编码的UTF-16,声明为UTF-8

让我们看看当我们创建一个编码为UTF-16但编码声明设置为UTF-8的XML文档时会发生什么.

Let's see what happens in reality when we create an XML document encoded as UTF-16 but with the encoding declaration set to UTF-8.

Opera,Firefox和Chrome都将文档解释为UTF-16,而忽略了编码声明. Internet Explorer(至少为版本9)显示空白文档,但没有实际错误.

Opera, Firefox and Chrome all interpret the document as UTF-16, ignoring the encoding declaration. Internet Explorer (version 9 at least), displays a blank document, but no actual error.

因此,如果您在UTF-8文档中包含UTF-8编码声明,并且稍后有人将其转换为UTF-16,它将在大多数浏览器中都有效,但在IE中会失败(而且,我认为,大多数Microsoft XML API).如果您取消了编码声明,那就没问题了.

So if you include a UTF-8 encoding declaration on your UTF-8 document and someone at a later stage converts it to UTF-16, it'll work in most browsers, but fail in IE (and, I assume, most Microsoft XML APIs). If you had left the encoding declaration off, you would have been fine.

从技术上讲,我认为IE是最准确的.它不显示错误的事实可以由以下事实解释:错误发生在编码级别而不是XML级别.据推测,尽力将UTF-16字符解释为UTF-8,却找不到任何可解码的字符,并最终将空字符序列传递给XML解析器.

Technically I think IE is the most accurate. The fact that it doesn't display an error as such might be explained by the fact that the error is occurring at the encoding level rather than the XML level. It is assumedly doing its best to interpret the UTF-16 characters as UTF-8, failing to find any characters that decode, and ending up passing on an empty character sequence to the XML parser.

编码为UTF-8,否则声明为

您现在可能会认为Firefox,Chrome和Opera只是完全忽略了编码声明,但这并非总是如此.

You might now think that Firefox, Chrome and Opera are just ignoring the encoding declaration altogether, but that's not always the case.

如果您将文档编码为UTF-8(带有字节顺序标记,因此它与其他内容毫无区别),但是将编码声明设置为Latin1,则所有浏览器都会成功将内容解码为Latin1,而忽略了UTF- 8个物料清单.

If you encode a document as UTF-8 (with a byte order marker so it's unmistakable as anything else), but set the encoding declaration to Latin1, all of the browsers will successfully decode the content as Latin1, ignoring the UTF-8 BOM.

再次,这对我来说似乎是正确的. BOM字符在Latin1中无效,这一事实意味着它们在字符解码级别已被静默删除.

Again this seems right to me. The fact that the BOM characters aren't valid in Latin1 just means they are silently dropped at the character decoding level.

但这不适用于UTF-8文档上的所有声明的编码.如果声明的编码为UTF-16,我们将返回Opera,Firefox和Chrome,而忽略声明的编码,而Internet Explorer返回一个空白文档.

This doesn't work for all declared encodings on a UTF-8 document though. If the declared encoding is UTF-16, we're back with Opera, Firefox and Chrome ignoring the declared encoding, while Internet Explorer returns a blank document.

本质上,任何使IE返回空白文档的事情都会使其他浏览器忽略声明的编码.

Essentially, anything that makes IE return a blank document is going to make other browsers ignore the declared encoding.

其他不一致之处

值得一提的是字节顺序标记的重要性.根据规范的第4.3.3节:

It's also worth mentioning the importance of the Byte Order Mark. According to section 4.3.3 of the spec:

以UTF-16编码的实体必须[...]以字节顺序标记开头

Entities encoded in UTF-16 MUST [...] begin with the Byte Order Mark

但是,如果您尝试阅读不带BOM的UTF-16编码XML文档,那么大多数浏览器仍将其视为有效.只有Firefox将其报告为XML解析错误.

However, if you try and read a UTF-16 encoded XML document without a BOM, most browsers will nevertheless accept it as valid. Only Firefox reports it as an XML Parsing Error.

外部编码信息

到目前为止,我们一直在考虑当没有外部编码信息时会发生什么,但是,正如其他人提到的那样,如果通过HTTP接收文档或将其封装在某种MIME信封中,则来自这些来源应优先于文档编码.

Up to now, we've been considering what happens when there is no external encoding information, but, as others have mentioned, if the document is received via HTTP or enclosed in a MIME envelope of some sort, the encoding information from those sources should take preference over the document encoding.

RFC3023 中介绍了各种XML MIME类型的大多数详细信息.但是,实际情况与指定的情况有所不同.

Most of the details for the various XML MIME types are described in RFC3023. However, the reality is somewhat different from what is specified.

首先,具有省略的charset参数的text/xml应该使用US-ASCII的字符集,但是几乎总是忽略了这一要求.浏览器通常将使用XML编码声明的值,如果没有,则默认使用UTF-8.

First of all, text/xml with an omitted charset parameter should use a charset of US-ASCII, but that requirement has almost always been ignored. Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none.

第二,如果文档上有UTF-8 BOM,并且XML编码声明为UTF-8或未包含,则文档将被解释为UTF-8,而与Content-中使用的字符集无关.输入

Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.

仅当没有BOM且在Content-Type中指定了显式字符集时,Content-Type的编码才似乎优先.

The only time the encoding from the Content-Type seems to take precedence is when there is no BOM and an explicit charset is specified in the Content-Type.

无论如何,在任何情况下(涉及Content-Type),在UTF-8文档上包括UTF-8 XML编码声明与根本没有编码声明有什么不同.

In any event, there are no cases (involving Content-Type) where including a UTF-8 XML encoding declaration on a UTF-8 document is any different from not having an encoding declaration at all.

这篇关于XML声明中的默认编码(UTF-8)如何默认?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆