&#x80的规格理由;到Ÿ在UTF-8文档中需要浏览器行为 [英] Spec justification for € to Ÿ in UTF-8 documents browser behaviour wanted

查看:148
本文介绍了&#x80的规格理由;到Ÿ在UTF-8文档中需要浏览器行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

HTML 4.01规范针对十六进制字符引用


数字字符引用指定文档字符集中字符
的代码位置。


所以如果文档字符集编码是UTF-8,数字引用应该指定一个Unicode代码点。



HTML5规范针对十六进制字符引用


符号必须后跟U + 0023 NUMBER SIGN字符(#),
,后面必须跟随U + 0078 LATIN小字母X
字符(x)或U + 0058拉丁大写字母X字符(X),其中
之后必须后跟一个或多个位于U + 0030 DIGIT
范围内的数字ZERO(0)到U + 0039 DIGIT NINE(9), U + 0061拉丁文小写字母A至
U + 0066拉丁文小写字母F,以及U + 0041拉丁大写字母A至
U + 0046拉丁大写字母F,代表基数十六整数
对应于Unicode代码点,根据
允许下面的定义。这些数字后面必须跟着U + 003B
SEMICOLON字符(;)。

没有提及文档字符集,它只是说数字值标识一个Unicode代码点。

但似乎所有的现代浏览器(我没有测试过旧的浏览器)对待€通过Ÿ就好像它们引用了Windows-1252一样。例如,€

显示,但U + 0080不是的代码点,U + 20AC是。并且U + 0080的Unicode代码点被定义为 PAD



€ (正确地)显示



这是浏览器的简单实用行为,还是在规范中有理由我失踪了?



[请注意,小数字符引用具有相同的行为。我刚刚使用了十六进制的清晰度和一致性。]

解决方案

我找到了我的问题的答案。它位于HTML5中的解析算法的标记化部分,用于消耗一个字符引用,它定义了这些字符的映射。


The HTML 4.01 spec says for hexadecimal character references

Numeric character references specify the code position of a character in the document character set.

So if the document character set encoding is UTF-8, the numeric references should specify a Unicode code point.

The HTML5 spec says for hexadecimal character references

The ampersand must be followed by a U+0023 NUMBER SIGN character (#), which must be followed by either a U+0078 LATIN SMALL LETTER X character (x) or a U+0058 LATIN CAPITAL LETTER X character (X), which must then be followed by one or more digits in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F, representing a base-sixteen integer that corresponds to a Unicode code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

No mention is made of the document character set, and it simply says that the numeric value identifies a Unicode code point.

But it seems that all the modern browsers (I haven't tested older ones) treat € through Ÿ as if they were referencing Windows-1252

For example, € displays , but U+0080 isn't the code point for , U+20AC is. And the Unicode code point for U+0080 is defined as PAD

€ also (correctly) displays .

Is this simply pragmatic behaviour by browsers or is there a justification in a specification that I'm missing?

[Note that decimal character references have the same behaviour. I've just used the hexadecimal ones for clarity and consistency.]

解决方案

I found the answer to my question. It's in the tokenization section of the parsing algorithm in HTML5 for consume a character reference, which defines the mapping for these characters.

这篇关于&#x80的规格理由;到Ÿ在UTF-8文档中需要浏览器行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆