是否存在未在UTF-8编码中使用的字节? [英] Are there bytes that are not used in the UTF-8 encoding?

查看:93
本文介绍了是否存在未在UTF-8编码中使用的字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我了解,UTF-8是ascii的超集,因此包含了不用于表示可打印字符的控制字符.

As i understand it UTF-8 is a superset of ascii and therefore includes the control characters which are not used to represent printable characters.

我的问题是:是否存在UTF-8编码未使用的(256个不同的)字节?

我想知道您是否可以将UTF-8文本转换/编码为二进制.

I wondered if you could convert/encode UTF-8 text to binary.

这是我的流程:

Here my though process:

我不知道utf-8文本编码的工作原理以及如何使用这么多字符(只是它对不在ascii(latin-1 ??)中的字符使用多个字节),但是我知道ascii文本是在utf-8中有效,因此控制字符(字节0-30)不是utf-8编码使用的方式有所不同,但同时不用于显示字符,对吗?

I have no idea how the utf-8 text encoding works and how it can use so many characters(only that it uses multiple bytes for characters not in ascii (latin-1??)) but i know that ascii text is valid in utf-8 so the control characters (bytes 0-30) are not used differently by the utf-8 encoding but they are at the same time not used for displaying characters, right??

因此在256个不同的字节中,仅使用〜230.对于1000个(二进制)长的unicode文本,只有1000 ^ 230个不同的文本?正确

so of the 256 different bytes only ~230 are used. for a 1000(binary) long unicode text there are only 1000^230 different texts? right

如果是这样,则可以将其转换为小于1000字节的二进制数据.

if that is true you could convert it to a binary data which is smaller than 1000 bytes.

Wolfram alpha : 1000个字节的unicode(假设unicode仅使用256个不同字节中的230个)-> 496个字节

推荐答案

您必须区分字符 Unicode UTF-8编码 :

在ASCII,LATIN-1等编码中,一个字符与0到255之间的一个数字是一对一的关系,因此一个字符可以被一个字节精确地编码(例如,"A"-> 65).要解码这样的文本,您需要知道使用了哪种编码(65真的意味着"A"吗?).

In encodings like ASCII, LATIN-1, etc. there is a one-to-one relation of one character to one number between 0 and 255 so a character can be encoded by exactly one byte (e.g. "A"->65). For decoding such a text you need to know which encoding was used (does 65 really mean "A"?).

为克服这种情况, Unicode 为每个 Character (包括各种特殊字符,如控制字符,变音符号等)分配一个范围从0开始的唯一数字.到0x10FFFF(所谓的 Unicode代码点).由于此范围不适合一个字节,因此问题是如何编码.有几种方法可以做到这一点,例如最简单的方法将始终为每个字符使用4个字节.由于这会占用大量空间,因此 UTF-8 的编码效率更高:这里的每个 Unicode代码点(= Character )被编码为一个,两个,三个或四个字节(对于这种编码,并非使用从0到255的所有字节值,但这只是技术上的问题)细节).

To overcome this situation Unicode assigns every Character (including all kinds of special things like control characters, diacritic marks, etc.) a unique number in the range from 0 to 0x10FFFF (so-called Unicode codepoint). As this range does not fit into one byte the question is how to encode. There are several ways to do this, e.g. simplest way would always use 4 bytes for each character. As this consumes a lot of space a more efficient encoding is UTF-8: Here every Unicode codepoint (= Character) is encoded in one, two, three or four bytes (for this encoding not all byte values from 0 to 255 are used but this is only a technical detail).

这篇关于是否存在未在UTF-8编码中使用的字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆