&＃x80的规格理由;到&＃x9F;在UTF-8文档中需要浏览器行为 [英] Spec justification for &#x80; to &#x9F; in UTF-8 documents browser behaviour wanted

查看：148 发布时间：2018/6/22 19:34:57 html utf-8 windows-1252 character-reference

本文介绍了&＃x80的规格理由;到&＃x9F;在UTF-8文档中需要浏览器行为的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

数字字符引用指定文档字符集中字符
的代码位置。

所以如果文档字符集编码是UTF-8，数字引用应该指定一个Unicode代码点。

符号必须后跟U + 0023 NUMBER SIGN字符（＃），
，后面必须跟随U + 0078 LATIN小字母X
字符（x）或U + 0058拉丁大写字母X字符（X），其中
之后必须后跟一个或多个位于U + 0030 DIGIT
范围内的数字ZERO（0）到U + 0039 DIGIT NINE（9）， U + 0061拉丁文小写字母A至
U + 0066拉丁文小写字母F，以及U + 0041拉丁大写字母A至
U + 0046拉丁大写字母F，代表基数十六整数
对应于Unicode代码点，根据
允许下面的定义。这些数字后面必须跟着U + 003B
SEMICOLON字符（;）。

没有提及文档字符集，它只是说数字值标识一个Unicode代码点。

但似乎所有的现代浏览器（我没有测试过旧的浏览器）对待&＃x80;通过&＃x9F;就好像它们引用了Windows-1252一样。例如，&＃x80;

显示€，但U + 0080不是€的代码点，U + 20AC是。并且U + 0080的Unicode代码点被定义为 PAD

&＃x20AC; （正确地）显示€。

这是浏览器的简单实用行为，还是在规范中有理由我失踪了？

[请注意，小数字符引用具有相同的行为。我刚刚使用了十六进制的清晰度和一致性。]

解决方案

我找到了我的问题的答案。它位于HTML5中的解析算法的标记化部分，用于消耗一个字符引用，它定义了这些字符的映射。

The HTML 4.01 spec says for hexadecimal character references

Numeric character references specify the code position of a character in the document character set.

So if the document character set encoding is UTF-8, the numeric references should specify a Unicode code point.

The HTML5 spec says for hexadecimal character references

The ampersand must be followed by a U+0023 NUMBER SIGN character (#), which must be followed by either a U+0078 LATIN SMALL LETTER X character (x) or a U+0058 LATIN CAPITAL LETTER X character (X), which must then be followed by one or more digits in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F, representing a base-sixteen integer that corresponds to a Unicode code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

No mention is made of the document character set, and it simply says that the numeric value identifies a Unicode code point.

But it seems that all the modern browsers (I haven't tested older ones) treat  through  as if they were referencing Windows-1252

For example,  displays €, but U+0080 isn't the code point for €, U+20AC is. And the Unicode code point for U+0080 is defined as PAD

€ also (correctly) displays €.

Is this simply pragmatic behaviour by browsers or is there a justification in a specification that I'm missing?

[Note that decimal character references have the same behaviour. I've just used the hexadecimal ones for clarity and consistency.]

解决方案

I found the answer to my question. It's in the tokenization section of the parsing algorithm in HTML5 for consume a character reference, which defines the mapping for these characters.

这篇关于&＃x80的规格理由;到&＃x9F;在UTF-8文档中需要浏览器行为的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

&＃x80的规格理由;到&＃x9F;在UTF-8文档中需要浏览器行为 [英] Spec justification for &#x80; to &#x9F; in UTF-8 documents browser behaviour wanted

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

&＃x80的规格理由;到&amp;＃x9F;在UTF-8文档中需要浏览器行为 [英] Spec justification for &amp;#x80; to &amp;#x9F; in UTF-8 documents browser behaviour wanted

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

&＃x80的规格理由;到&＃x9F;在UTF-8文档中需要浏览器行为 [英] Spec justification for  to  in UTF-8 documents browser behaviour wanted

登录关闭