一个Unicode字符需要多少字节？ [英] How many bytes does one Unicode character take?

查看：212 发布时间：2017/8/16 19:10:35 string language-agnostic unicode encoding

本文介绍了一个Unicode字符需要多少字节？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对编码有些困惑。据我所知，旧的ASCII字符每字符一个字节。 Unicode字符需要多少字节？

我假设一个Unicode字符可以包含任何语言的每个可能的字符 - 我是正确的吗？那么每个字符需要多少个字节？

什么是UTF-7，UTF-6，UTF-16等等？它们是不同版本的Unicode吗？

我阅读了关于Unicode的维基百科文章/ a>但对我而言是相当困难的。我很期待看到一个简单的答案。

解决方案

你不会看到一个简单的答案，因为没有一个。

首先，Unicode不包含每种语言的每个字符，尽管它确实会尝试。

通常一个字符相关联。我通常说，因为有像字符组合的概念。你可能会熟悉像口音或变音符号这样的东西。可以使用另一个字符，例如 a 或 u 来创建一个新的逻辑字符。因此，一个字符可以由一个或多个代码点组成。

为了在计算系统中有用，我们需要为此信息选择一个表示。那些是各种unicode编码，如utf-8，utf-16le，utf-32等。它们的特征主要在于它们的代码单元的大小。 UTF-32是最简单的编码，它有一个32位的代码单元，这意味着一个单独的代码点可以舒适地适应于代码单元。其他编码将有一个代码点需要多个代码单元的情况，或者特定的代码点根本无法在编码中表示（这是UCS-2中的一个问题）。

由于组合字符的灵活性，即使在给定的编码中，每个字符的字节数可以根据字符和归一化形式而有所不同。这是一个用于处理具有多个表示的字符的协议（您可以说ana，带有口音重音a，它是一个代码点）。

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?

I assume that one Unicode character can contain every possible character from any language - am I correct? So how many bytes does it need per character?

And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?

I read the Wikipedia article about Unicode but it is quite difficult for me. I am looking forward to seeing a simple answer.

解决方案

You won't see a simple answer because there isn't one.

First, Unicode doesn't contain "every character from every language", although it sure does try.

Unicode itself is a mapping, it defines codepoints and a codepoint is a number, associated with usually a character. I say usually because there are concepts like combining characters. You may be familiar with things like accents, or umlauts. Those can be used with another character, such as an a or a u to create a new logical character. A character therefore can consist of 1 or more codepoints.

To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can't be represented in the encoding at all (this is a problem for instance with UCS-2).

Because of the flexibility of combining characters, even within a given encoding the number of bytes per character can vary depending on the character and the normalization form. This is a protocol for dealing with characters which have more than one representation (you can say "an 'a' with an accent" which is 2 codepoints, one of which is a combining char or "accented 'a'" which is one codepoint).

这篇关于一个Unicode字符需要多少字节？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

一个Unicode字符需要多少字节？ [英] How many bytes does one Unicode character take?

问题描述

相关文章

开发方法最新文章

热门教程

热门工具

登录关闭

一个Unicode字符需要多少字节？ [英] How many bytes does one Unicode character take?

问题描述

相关文章

开发方法最新文章

热门教程

热门工具

登录 关闭

登录关闭