UTF-8-矛盾的定义 [英] UTF-8 - contradictory definitions

查看：100 发布时间：2020/7/13 6:19:34 html unicode encoding utf-8

本文介绍了UTF-8-矛盾的定义的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对UTF-8编码的理解是，UTF-8字符的第一个字节带有两个字符之一

My understanding of UTF-8 encoding is that the first byte of a UTF-8 char carries either

对于单字节ASCII范围代码点，低7位(0-6)的数据和高(7)位的数据被清除
低5位(0-4)中的数据，高7-5 = 110位表示2字节字符
低4位(0-3)中的数据，高7-4 = 1110位表示3字节字符
低5位(0-2)中的数据，高7-3 = 11110位表示4字节字符

请注意，总是设置7位，这告诉utf-8解析器这是一个多字节字符.

noting that bit 7 is always set and this tells utf-8 parsers that this is a multi-byte char.

这意味着在128-255范围内的任何unicode代码点都必须以2个或更多字节进行编码，因为如果要以1个字节进行编码，则将高位保留在UTF-8中，以用于多字节指示符位".所以字符é(e-acute，它是unicode代码点\ u00E9，十进制233十进制)以UTF-8编码为两个字节的字符\ xC3A9.

This means that any unicode code-point in the range 128-255 has to be encoded in 2 or more bytes, because the high bit that is required if they were to be encoded in one byte is reserved in UTF-8 for the 'multi-byte indicator bit'. So e.g. the character é (e-acute, which is unicode code-point \u00E9, 233 decimal) is encoded in UTF-8 as a two byte character \xC3A9.

此处中的下表显示了如何使用UTF- 8为\ xC3A9.

The following table from here shows how the code-point \u00E9 is encoded in UTF-8 as \xC3A9.

但是，这似乎不是它在网页中的工作方式.我最近在呈现Unicode字符时遇到了一些矛盾的行为，而在我的探索性阅读中发现了这一点:

However this is not how it works in a web page it seems. I have recently had some contradictory behavior in the rendering of unicode chars, and in my exploratory reading came across this:

对于从160到255的值，UTF-8与ANSI和8859-1相同." ( w3schools )

明显与上述矛盾.

如果我在 jsfiddle 中呈现这些各种值，我会得到

And if I render these various values in jsfiddle I get

因此，HTML将Unicode 代码点呈现为é，而不是该代码点的UTF-8 2字节编码.实际上，HTML将UTF-8字符\ xC3A9渲染为具有代码点\ xC3A9的韩文音节:

So HTML is rendering the unicode code-point as é, not the UTF-8 2-byte encoding of that code-point. In fact HTML renders the UTF-8 char \xC3A9 as the Hangul syllable that has the code-point \xC3A9:

W3schools 的表将é的UTF-8明确定义为Decimal. 233(\ xE9):

W3schools has a table that explicitly defines the UTF-8 of é as Decimal 233 (\xE9):

所以HTML呈现的是代码点，而不是UTF-8字符.

So HTML is rendering code-points, not UTF-8 chars.

我在这里错过了什么吗?谁能向我解释为什么在所谓的UTF-8 HTML文档中似乎根本没有进行UTF-8解析?

Am I missing something here? Can anyone explain to me why in a supposedly UTF-8 HTML document, it seems like there is no UTF-8 parsing going on at all?

UTF-8-矛盾的定义 [英] UTF-8 - contradictory definitions

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

UTF-8-矛盾的定义 [英] UTF-8 - contradictory definitions

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭