一个 Unicode 字符需要多少字节? [英] How many bytes does one Unicode character take?

查看:17
本文介绍了一个 Unicode 字符需要多少字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对编码有点困惑.据我所知,旧的 ASCII 字符每个字符占用一个字节.一个 Unicode 字符需要多少字节?

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?

我假设一个 Unicode 字符可以包含来自任何语言的所有可能的字符 - 我是否正确?那么每个字符需要多少字节呢?

I assume that one Unicode character can contain every possible character from any language - am I correct? So how many bytes does it need per character?

UTF-7、UTF-6、UTF-16 等是什么意思?它们是不同版本的 Unicode 吗?

And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?

我阅读了维基百科关于 Unicode 的文章,但对我来说很难.我期待看到一个简单的答案.

I read the Wikipedia article about Unicode but it is quite difficult for me. I am looking forward to seeing a simple answer.

推荐答案

您不会看到简单的答案,因为根本没有.

You won't see a simple answer because there isn't one.

首先,Unicode 不包含来自每种语言的每个字符",尽管它确实尝试过.

First, Unicode doesn't contain "every character from every language", although it sure does try.

Unicode 本身是一个映射,它定义了代码点,而代码点是一个数字,与通常一个字符相关联.我说通常是因为有像组合字符这样的概念.您可能熟悉口音或元音变音等内容.这些可以与另一个字符一起使用,例如 au 以创建新的逻辑字符.因此,一个字符可以由 1 个或多个代码点组成.

Unicode itself is a mapping, it defines codepoints and a codepoint is a number, associated with usually a character. I say usually because there are concepts like combining characters. You may be familiar with things like accents, or umlauts. Those can be used with another character, such as an a or a u to create a new logical character. A character therefore can consist of 1 or more codepoints.

为了在计算系统中有用,我们需要为这些信息选择一个表示.这些是各种 unicode 编码,例如 utf-8、utf-16le、utf-32 等.它们的主要区别在于它们的代码单元的大小.UTF-32 是最简单的编码,它有一个 32 位的代码单元,这意味着单个代码点可以轻松地放入一个代码单元中.其他编码会出现这样的情况:一个代码点需要多个代码单元,或者该特定代码点根本无法在编码中表示(例如,这是 UCS-2 的问题).

To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can't be represented in the encoding at all (this is a problem for instance with UCS-2).

由于组合字符的灵活性,即使在给定的编码中,每个字符的字节数也会因字符和规范化形式而异.这是一种用于处理具有多个表示的字符的协议(您可以说 带有重音的 'a'",它是 2 个代码点,其中一个是组合字符或 重音'a'" 这是一个代码点).

Because of the flexibility of combining characters, even within a given encoding the number of bytes per character can vary depending on the character and the normalization form. This is a protocol for dealing with characters which have more than one representation (you can say "an 'a' with an accent" which is 2 codepoints, one of which is a combining char or "accented 'a'" which is one codepoint).

这篇关于一个 Unicode 字符需要多少字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆