为什么UTF-8使用多个字节表示某些字符? [英] Why does UTF-8 use more than one byte to represent some characters?

查看:110
本文介绍了为什么UTF-8使用多个字节表示某些字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近浏览了有关字符编码的文章 .我对此处提到的某点感到担忧.

I recently went through an article on Character Encoding. I've a concern on a certain point mentioned there.

在第一张图中,作者显示了字符,它们在各种字符集中的代码点以及如何以各种编码格式进行编码. 例如,é的代码点是E9. 在ISO-8859-1编码中,它表示为E9. 在UTF-16中,它表示为00 E9. 但是在UTF-8中,它用2个字节C3 A9表示.

In the first figure, the author shows the characters, their code points in various character sets and how they are encoded in various encoding formats. For example the code point of é is E9. In ISO-8859-1 encoding it is represented as E9. In UTF-16 it is represented as 00 E9. But in UTF-8 it is represented using 2 bytes, C3 A9.

我的问题是为什么要这样做?可以用1个字节表示.为什么要使用两个字节?你能告诉我吗?

My question is why is this required? It can be represented with 1 byte. Why are two bytes used? Can you please let me know?

推荐答案

UTF-8 使用高2位(第6位和第7位)来指示是否还有更多字节:仅低6位用于实际字符数据.这意味着7F上的任何字符都(至少)需要2个字节.

UTF-8 uses the 2 high bits (bit 6 and bit 7) to indicate if there are any more bytes: Only the low 6 bits are used for the actual character data. That means that any character over 7F requires (at least) 2 bytes.

这篇关于为什么UTF-8使用多个字节表示某些字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆