UTF-8可以编码多少个字符? [英] How many characters can UTF-8 encode?

查看:965
本文介绍了UTF-8可以编码多少个字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果UTF-8是8位,这并不意味着只能有最多256个不同的字符?



前128个代码点与在ASCII。但它说UTF-8可以支持多达百万个字符。



这是如何工作的?

UTF-8不会一直使用一个字节,而是1到4个字节。


前128个字符(US-ASCII)需要一个字节。



接下来的1,920个字符需要两个字节进行编码。这包括几乎所有拉丁字母的其余部分,以及希腊语,西里尔语,科普特语,亚美尼亚语,希伯来语,阿拉伯语,叙利亚语和塔纳字母,以及组合变音符号。



基本多语言平面的其余部分中的字符需要三个字节,其中包含几乎所有常用字符[12],包括大多数中文,日语和韩语[CJK]字符。



Unicode的其他平面中的字符需要四个字节,包括不太常见的CJK字符,各种历史脚本,数学符号和表情符号(象形符号)。


来源:维基百科


If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters?

The first 128 code points are the same as in ASCII. But it says UTF-8 can support up to million of characters?

How does this work?

解决方案

UTF-8 does not use one byte all the time, it's 1 to 4 bytes.

The first 128 characters (US-ASCII) need one byte.

The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.

Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.

Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

source: Wikipedia

这篇关于UTF-8可以编码多少个字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆