UTF-8-矛盾的定义 [英] UTF-8 - contradictory definitions

查看:100
本文介绍了UTF-8-矛盾的定义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对UTF-8编码的理解是,UTF-8字符的第一个字节带有两个字符之一

My understanding of UTF-8 encoding is that the first byte of a UTF-8 char carries either

  1. 对于单字节ASCII范围代码点,低7位(0-6)的数据和高(7)位的数据被清除
  2. 低5位(0-4)中的数据,高7-5 = 110位表示2字节字符
  3. 低4位(0-3)中的数据,高7-4 = 1110位表示3字节字符
  4. 低5位(0-2)中的数据,高7-3 = 11110位表示4字节字符

请注意,总是设置7位,这告诉utf-8解析器这是一个多字节字符.

noting that bit 7 is always set and this tells utf-8 parsers that this is a multi-byte char.

这意味着在128-255范围内的任何unicode代码点都必须以2个或更多字节进行编码,因为如果要以1个字节进行编码,则将高位保留在UTF-8中,以用于多字节指示符位".所以字符é(e-acute,它是unicode代码点\ u00E9,十进制233十进制)以UTF-8编码为两个字节的字符\ xC3A9.

This means that any unicode code-point in the range 128-255 has to be encoded in 2 or more bytes, because the high bit that is required if they were to be encoded in one byte is reserved in UTF-8 for the 'multi-byte indicator bit'. So e.g. the character é (e-acute, which is unicode code-point \u00E9, 233 decimal) is encoded in UTF-8 as a two byte character \xC3A9.

此处中的下表显示了如何使用UTF- 8为\ xC3A9.

The following table from here shows how the code-point \u00E9 is encoded in UTF-8 as \xC3A9.

但是,这似乎不是它在网页中的工作方式.我最近在呈现Unicode字符时遇到了一些矛盾的行为,而在我的探索性阅读中发现了这一点:

However this is not how it works in a web page it seems. I have recently had some contradictory behavior in the rendering of unicode chars, and in my exploratory reading came across this:

  • 对于从160到255的值,UTF-8与ANSI和8859-1相同." ( w3schools )

明显与上述矛盾.

如果我在 jsfiddle 中呈现这些各种值,我会得到

And if I render these various values in jsfiddle I get

因此,HTML将Unicode 代码点呈现为é,而不是该代码点的UTF-8 2字节编码.实际上,HTML将UTF-8字符\ xC3A9渲染为具有代码点\ xC3A9的韩文音节:

So HTML is rendering the unicode code-point as é, not the UTF-8 2-byte encoding of that code-point. In fact HTML renders the UTF-8 char \xC3A9 as the Hangul syllable that has the code-point \xC3A9:

W3schools 的表将é的UTF-8明确定义为Decimal. 233(\ xE9):

W3schools has a table that explicitly defines the UTF-8 of é as Decimal 233 (\xE9):

所以HTML呈现的是代码点,而不是UTF-8字符.

So HTML is rendering code-points, not UTF-8 chars.

我在这里错过了什么吗?谁能向我解释为什么在所谓的UTF-8 HTML文档中似乎根本没有进行UTF-8解析?

Am I missing something here? Can anyone explain to me why in a supposedly UTF-8 HTML document, it seems like there is no UTF-8 parsing going on at all?

推荐答案

您对UTF-8字节编码的理解是正确的.

Your understanding of the encoding of UTF-8 bytes is correct.

您的jsfiddle示例仅将UTF-8用作HTML文件的字节编码(因此使用了<meta charset="UTF-8"> HTML标记),而没有将其用作HTML本身的编码. HTML仅使用ASCII字符作为标记,但是该标记可以表示 Unicode字符.

Your jsfiddle example is using UTF-8 only as a byte encoding for the HTML file (hence the use of the <meta charset="UTF-8"> HTML tag), but not as an encoding of the HTML itself. HTML only uses ASCII characters for its markup, but that markup can represent Unicode characters.

UTF-8是Unicode代码点的字节编码.它通常用于Unicode数据的传输,例如通过HTTP的HTML文件.但是HTML本身仅是根据Unicode代码点定义的,而不是专门针对UTF-8定义的. Web浏览器将通过网络接收原始的UTF-8字节并将其解码为Unicode代码点,然后再在HTML上下文中对其进行处理.

UTF-8 is a byte encoding for Unicode codepoints. It is commonly used for transmissions of Unicode data, such as an HTML file over HTTP. But HTML itself is defined in terms of Unicode codepoints only, not in UTF-8 specifically. A webbrowser would receive the raw UTF-8 bytes over the wire and decode them to Unicode codepoints before processing them in the context of the HTML.

HTML实体仅处理Unicode代码点,而不处理诸如UTF-8中使用的代码单位.

HTML entities deal in Unicode codepoints only, not in codeunits, such as used in UTF-8.

&#<xxx>;格式的HTML实体直接通过其数字值表示Unicode代码点.

HTML entities in &#<xxx>; format represent Unicode codepoints by their numeric values directly.

  • &#233; (é) and &#xE9; (é) represent integer 233 in decimal and hex formats, respectively. 233 is the numeric value of Unicode codepoint U+00E9 LATIN SMALL LETTER E WITH ACUTE, which is encoded in UTF-8 bytes as 0xC3 0xA9.

&#xc3a9;(쎩)代表十六进制格式(0xC3A9)的整数50089. 50089是Unicode代码点 U+C3A9 HANGUL SYLLABLE SSYEOLG 的数值,它以UTF-8编码为字节0xEC 0x8E 0xA9.

&#xc3a9; (쎩) represents integer 50089 in hex format (0xC3A9). 50089 is the numeric value of Unicode codepoint U+C3A9 HANGUL SYLLABLE SSYEOLG, which is encoded in UTF-8 as bytes 0xEC 0x8E 0xA9.

&<name>;格式的HTML实体通过HTML定义的易于理解的名称表示Unicode代码点.

HTML entities in &<name>; format represent Unicode codepoints by a human-readable name defined by HTML.

  • &eacute;(é)表示Unicode代码点U+00E9,与&#233;&#xE9;相同.
  • &eacute; (é) represents Unicode codepoint U+00E9, same as &#233; and &#xE9; do.

这篇关于UTF-8-矛盾的定义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆